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PREFACE 


Elementary Statistics and Applications is designed for a begin- 
ning course. Principles of gathering and presenting statistics, 
frequency-distribution analysis, probability theory and the 
normal curve, correlation, time-series analysis, and forecasting 
are included. Elementary sampling procedure, only so far as 
it is founded upon the assumption of normal sampling dis- 
tributions, is also included. 

No attempt has been made to include any of the less con- 
ventional methods of time-series analysis. Some are too mathe- 
matical for treatment in an elementary text. Others are so 
highly specialized or so subjective as to be unsuited for textbook 
material. Many of these are new methods that need to be 
further systematized, coordinated, and tested in the crucible of 
time and experience. 

The approach in this book is that of the teacher. The authors 
have been associated in teaching statistics for more than ten 
years. The manuscript of the present text evolved during those 
years in mimeographed form, modified from year to year as new 
theories developed and as teaching use required, The sug- 
gestions of students, whether consciously or unconsciously made, 
have helped formulate this book. Experience has shown that 
students gain a sense of the close association of statistics to 
reality from the brief discussions of the historical origin of impor- 
tant steps in the development of statistical theory that are 
included. 

The descriptions of frequency-distribution, correlation, and 
time-series analysis are first completed in their simplest aspects, 
with elementary illustrations. This enables the student to 
visualize basic method unmixed with the more advanced phases. 
More complex illustrations of practical application are then given 
in separate chapters or separate sections. This practice elimi- 
nates the apparent digression that seems to hamper the student, 
when exposition of method and complicated illustrations are 
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intermixed, as in the conventional text. In separating the two, 
moreover, the fact is recognized that the best order of presenta- 
tion for teaching is not the best order of procedure for working 
an actual problem. For example, the handiest method for 
making a frequency-distribution analysis is to set up a work sheet 
with a Charlier check and first calculate the moments or the k 
statistics; but the theories of the moments and of the k statistics 
are among the most difficult parts of the analysis to explain and 
are not therefore good introductory topies for the teaching of 
frequency-distribution analysis. In addition, the practical 
analysis of the frequency distribution introduces short cuts, cross 
checks, or other timesaving devices. The authors believe that 
this new arrangement will also prove to be a boon to research 
workers who may use the text as a reference book. 

The more advanced points of statistical theory pertaining to 
frequency curves and sampling analysis have been placed in a 
separate book entitled Sampling Statistics and Applications. The 
two books together constitute a set on the subject of Funda- 
mentals of the Theory of Statistics. 

In both volumes, the authors have drawn freely upon the 
many monographs and the periodical literature that have 
appeared during recent years. Care has been exercised to 
make acknowledgment in footnotes to the sources of new ideas 
that have been incorporated into the authors’ own development 
of the subject. To all these vigorous workers in the field, too 
numerous to be listed by name, the authors as well as other 
statisticians are greatly indebted. 

More particularly the authors here acknowledge a debt of 
gratitude to several generous professional colleagues who have 
read parts of the manuscript with critical and judicious eye. 
Sidney W. Wilcox, Chief Statistician of the Bureau of Labor 
Statistics in the United States Department of Labor, made 
especially helpful suggestions for Chap. XIX, Index Numbers, 
for the chapters on probability theory, and for Parts I and II 
of Elementary Statistics and Applications. John H. Smith of 
the Bureau of Labor Statisties, contributed many stimulating 
criticisms and suggestions that the authors believe inspired 
important improvements. Lester 8. Kellogg, Bureau of Labor 
Statistics, read Chap. III, Sources of Statistics, and made sug- 
gestions that led to a constructive reworking of that material. 
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The authors are profoundly grateful for such generous assistance 
and wish to make full acknowledgment of their professional 
indebtedness to these men. 

The authors are grateful to the International Finance Section 
of Princeton University for the financial assistance given Acheson 
J. Duncan some years ago to enable him to study statistics and 
mathematical economies with the late Henry Schultz of the 
University of Chicago and with Harold Hotelling of Columbia 
University. The authors are indebted to those men, and to 
colleagues in the Mathematics Department at Princeton Uni- 
versity. The authors are also indebted to Professor R. A. 
Fisher, also to Messrs. Oliver & Boyd, Ltd., of Edinburgh, for 
permission to reprint an abridged edition of Table III, Table 
of x2, from their book Statistical Methods for Research Workers. 

Naturally, it is not to be supposed that the whole or any part 
of the manuscript carries the endorsement of the authors’ former 
teachers or those who have helped with criticisms of the manu- 
seript. The authors assume full responsibility for errors of 
theory or calculation that may be present in the volumes. 

James G. Burg. 
AcHESON J. DUNCAN. 
Princeton, N. J., 
August, 1944. 
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PART I 


Introduction 


CHAPTER I 
STATISTICS IN THE ARTS AND SCIENCES 


From the sixteenth century to the present day, modern sciences 
have stressed empirical method—the gathering of data by labora- 
tory experiment or by statistical observation. Laboratory 
experimentation has been more spectacularly employed in the 
natural sciences (biology, chemistry, physics, botany, and the 
like), and statistical observation has been more widely used in 
the social sciences (such as politics, economics, and psychology). 
Yet laboratory technique is used for some types of investigation 
in the social sciences, especially in psychology, education, and 
agriculture; and statistical technique is frequently employed 
in the natural sciences; for example, the modern kinetie theory 
of gases is a statistical argument. 

Economy and Flexibility of Statistics. Meaning of Statistics. 
Statistics and scientific method are of value wherever a mass of 
complicated facts exists and wherever those facts are amenable to 
quantitative expression. Qualitative knowledge must be con- 
verted into quantitative units of enumeration or of measurement 
before it becomes statistics. The quantitative units are either 
enumerative or measurement units. An enumerative unit 
depends upon proper definition of the objects to be counted; 
thus statistics may be compiled on the number of blue-eyed 
as compared with brown-eyed people, the number of yellow as 
compared with green beans, etc. A measurement unit depends 
upon contrivance of some unit of measurement for the purpose 
of converting qualitative knowledge into quantitative expression ; 
thus properly devised tests make it possible to measure intel- 
lectual aptitude on a scale so that certain quantity figures can 
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be depended upon to measure relative amounts of intellectual 
aptitude. 

Such quantitative description of facts makes it possible to 
give in a brief space a great amount of information. In order 
to accomplish this economy of time and space, however, it is 
of the greatest importance that the units of measurement or of 
enumeration be uniformly applied and that the nature of these 
units of measurement for observation or of enumeration be con- 
stantly kept in mind when the data are used. Furthermore, 
having always in mind the nature of the statistical units chosen 
as criteria of measurement, it is possible to arrange statistical 
data in such a manner as greatly to facilitate their interpretation. 

A large degree of flexibility is thus available when facts are 
expressed quantitatively; and, so long as the original units of 
measurement are not obscured, it is possible for specific purposes 
to arrange and rearrange a given set of data. A part of this 
flexibility is due to the fact that otherwise long, time-consum- 
ing methods of analysis can be resolved into relatively simple 
mathematical operations. These short cuts and the savings of 
human effort they make possible in the search for truth are only 
possible where knowledge can be expressed quantitatively, 
which is to say, by statistics. In using these short-cut methods, 
however, it is necessary to be ever watchful for hidden incon- 
sistencies with the original units of measurement, for it is in this 
realm that many of the misuses of statistics are found. 

Economy and a high degree of flexibility are characteristics of 
statistics that well fit them to serve a dynamic society’s needs for 
analysis and formulation of policy. It is a lesson learned from sad 
but profitable experience that statistics are something more than 
the mere will to collect facts in quantitative form. Careful study 
by many scholars has given rise to rules of procedure that must be 
followed if the economy and flexibility of which statistics are 
capable are to be realized. These rules of procedure constitute 
the science of statistics, to several aspects of which attention 
should be directed for differentiation. ‘Statistics’? is used 
broadly to refer to the whole field of the quantitative approach to 
knowledge, including the gathering of data, problems of statistical 
measurement, statistical analysis, statistical theory, and scientific 
method in general. The word “statistics” is also used to refer to 
any one of these parts of the whole subject. 
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Accordingly, while “statisties” is used in the broad sense 
indicated, it applies also more particularly and more accurately 
to compiled data that are systematic and quantitative expressions 
of facts or events. 

The theory of statistics is also called "statistics." The theory 
of statistics is the body of principles that has been developed, 
partly a priori by the mathematical approach and partly by 
empirical methods, to serve as a guide for sound statistics and 
sound statistical method. Understanding of the theory of sta- 
tistics is required also for compiling statistics. Statistical 
theory is required because nearly all compilations of quantitative 
facts are samples and not complete enumerations and because the 
fundamental rules regarding units of measurement must be 
obeyed in statistical enumeration if the resulting data are to be 
homogeneous, that is, comparable one with another. 

“Statistics” also refers to statistical method, a term used to 
describe the process of interpreting facts by the use of statistics 
and statistical theory. Careful study of the assembled sta- 
tistical data, obtained in a manner to secure internal compara- 
bility and arranged in well-planned tables, may be used as a 
basis for judgments or action. Further quantitative treatment, 
however, may frequently give greater significance to the sta- 
tistics. Selected summaries may bring out many relationships 
that would be difficult to visualize if they were in tables of figures 
that had been compiled for general purposes. This additional 
quantitative treatment is of the nature of classification and 
summarization. It is called “statistical analysis” and includes 
the methods of tabulation, graphs, averages, measures of varia- 
bility, correlation, index numbers, and similar quantitative 
analyses that have been developed. Judgments based on 
statistical analysis are called “statistical inferences.” Sta- 
tistical method, then, consists of two parts, (1) statistical analysis 
and (2) statistical inferences. 

In recent years the word “statistics” has also been used to 
describe figures that have been obtained by statistical analysis; 
for example, arithmetie means, average deviations, measures of 
correlation, and the like, are all called “statistics,” and any one 
of them alone is called a “statistic.” 

The word “statistics” is thus used to mean all these various 
things together and any one of them separately. This may make 
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for confusion, and in the above discussion such usage makes it 
appear as if terms were defined by use of the term defined; but 
such is established conventional, albeit confused, use of this all- 
inclusive word "statistics." 


THREE TYPES OF DATA 


Empirical vs. Experimental Data. Answering the accusation 
that their conclusions are so vague and unpredictable as to pre- 
clude scientific sanction, the social scientists have often pleaded 
that social studies cannot, like the theories of the natural sciences, 
be tested in the laboratory. The social sciences must rely only 
on statistics and empirical or historical methods. Social theories 
can be interpreted with respect to true life only if viewed in the 
light of a ceteris paribus assumption. The assumption that 
other things are equal, or unchanged, or in balance serves the 
social scientist in the same manner as controls over experimental 
conditions serve the natural scientists. 

With the development, on the one hand, of statistical methods 
in the natural sciences and the development, on the other hand, of 
experimental methods in the social sciences, this contrast is 
becoming less real. While it is still true that social science 
predominantly uses empirical or historical data, some important 
work has been done, and more important work appears in the 
offing, with experimental data in the fields of psychology, 
sociology, education, medicine, population studies, agricultural 
economics, and statistical control of quality of manufactured 
products, Such outstanding progress in the technical develop- 
ment of this experimental work has been made as to constitute 
almost a special field called “design of experiments.” t 

Design of Experiments. The arrangements for making the 
experiment and for recording the data therefrom constitute the 
design of the experiment. In designing an experiment, methods 
of so controlling the experiment as to prevent biased results must 
be devised. If, for example, the experiment is to test the effects 
upon cotton culture of a certain kind of fertilizer, several areas in 
various localities may be chosen in order to test the effect of 
the selected fertilizer under a number of climatic conditions. 
The design for the experiment must then plan also some means of 
measuring these various other influences, namely, temperature 

1Fisner, R. A., The Design of Experiments (1935). 
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and rainfall. Some method must also be devised for discovering, 
in the resulting data, not only how much of the productivity of 
cotton is due to the fertilizer, but also how much is due to the 
differing qualities of the soil, to varying amounts of rainfall, and to 
varying levels of temperature. The design for the experiment 
must plan and organize the procedure so that from the resulting 
data it will in truth be possible to measure the net influence of the 
new fertilizer. 

Where cost is a consideration, and it seldom is not, an impor- 
tant part of the design of experiment is to decide to what extent 
to experiment, in other words, how small an experiment will give š 
trustworthy results. Before doing this it must be decided how 
much precision in the results, for practical purposes, is required, 

The solution of some of the problems relating to design of 
experiment may be found by applying the theory of statistics. 
The solution to others is a matter of common sense, which some- 
times is more difficult to apply than might be supposed. 

Not only in such a case as testing the use of fertilizer, but in 
many problems, the researcher finds that a number of factors 
influence a given result. In agricultural phenomena, weather, 
climate, and other natural and human factors are present; in 
medicine, age, sex, and other conditions affect the application of 
treatment; in biochemical and in psychological experimentation, 
many human and natural variables enter. When it is necessary 
for a given purpose of analysis to isolate one of several influences, 
the data can be so selected or the treatments so applied as to hold 
other influences constant. For example, if age and sex as well as 
inoculation affect the outcome of pneumonia cases, the inocula- 
tion can be tested by comparing inoculated and noninoculated for 
those of the same sex or age group. It has become the practice 
to call the noninoculated group the ‘‘control”’ in the experiment." 

Hypothetico-observational Data. In addition to empirical and 
experimental data scientists make extensive use of a third type, 
namely, hypothetico-observational data.? For example, in the 
physical sciences, that the moon is about 240,000 miles from the 
earth is a hypothetico-observational datum—no one has carried 


1 (f. Hint, A. Bnapronp, Principles of Medical Statistics (1939), pp. 4-8 


and 170-178. 
2 EppivGTON, Sim ARTHUR, The Philosophy of Physical Science (1939), 


pp. 12-14. 
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out the experiment of measuring the distance from the earth to 
the moon; yet on the basis of certain hypotheses it is measured to 
a comparatively high degree of precision. In the social sciences, 
index numbers purporting to measure such items as the general 
level of prices are hypothetico-observational data. In both 
illustrations, upon the basis of certain hypotheses or theories, 
practical methods are devised for estimating the measurement in 
question. In appraising the resulting estimate, the precision of 
the underlying theory or hypothesis is of primary importance. 


SERVING THE ARTS AND SCIENCES 


Statistics and the Social Sciences and Arts. Politics. Public 
opinion, the opinion of the masses, can be ascertained at any time 
on a wide variety of social and politieal issues by means of 
statistical data collected by random questionnaires from a com- 
paratively small number of people. The employment of sta- 
tistical technique for this purpose has stirred the imagination 
and stimulated the ingenuity of students of the social and 
governmental processes. The widespread demand for such 
information and the relatively low cost of obtaining it by the 
sampling method have also gratified the acquisitiveness and 
lined the purses of a number of enterprising polling agents. 
Increasingly, political strategists appear to pay attention to these 
systematic statistical studies of public opinion. Both the major 
political parties in the United States have had expert statisticians 
engaged during the quadrennial presidential campaigns to keep 
their fingers on the pulse of public opinion. 

It has been claimed that “sampling referenda make the mass 
articulate, define the mandate of our leaders, reveal the true 
popular strength of pressure groups, and show social taboos 
quantitatively for what they are worth, . ... " that they are, in 
the language of journalism, “the fourth dimension for the Fourth 
Estate."! 

Governmental Administration. Statistics are extensively used 
as guides to various kinds of governmental administration, such 
as sanitation, hospitalization, highway supervision, and publie 
industrial accident and compensation insurance laws. For exam- 
ple, on the assumption that industrial accidents are due to unsafe 


1 GALLUP, GEonaE, “Government and Sampling Referendum," Journal 
of the American Statistical Association, Vol. 33 (1938), pp. 131-142. 
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prevent repetition of the same or similar types of accident, 
statistical data on causes of accident have been assembled. 
Study of these data enables the statistician to identify and select 
the unsafe elements in transportation conditions and then to 
present the data to safety engineers for guidance in accident 
prevention.! 

From its beginning in 1790 to the present day, the Federal 
government has considered statistics on foreign trade so impor- 
tant that an organization has been maintained for the express 
purpose of assembling such statistics. In the early years of the 
republic they were gathered by the Treasury Department, but 
now they are collected and published regularly by the Depart- 
ment of Commerce. With the rapid development of large-scale 
business organization in the latter half of the nineteenth century, 
publie policy with respect to social and economic conditions has 
required the Federal government to maintain a Bureau of Labor 
Statistics which has been engaged principally in the task of col- 
lecting and publishing statistics on prices, cost of living, and 
wages and, in more recent years, on employment and pay rolls in 
manufacturing industries of the United States. 

It is a matter of common knowledge to all who read newspapers 
that important laws are passed by city, state, and Federal gov- 
ernments on the basis of statistical facts assembled regularly 
or collected by special legislative committees. For example, the 
Federal Reserve System of banks in the United States was created 
in 1913 after a thorough study, involving extensive use of sta- 
tistics, of the banking situation in this and other countries; 
legislation in the decade of the 1930’s on publie works, relief work, 
and social security was largely based on studies of a statistical 
nature. 

Business Administration. Statistics are valuable in business 
administration, enabling the manufacturer executive to obtain 
more or less satisfactory answers to such a perplexing question as: 
Making allowance for seasonal changes and expected prices of 
substitute goods, what will be consumer demand for the coming 
year? Some manufacturers must make estimates for a year in 
advance; others can proceed successfully with monthly estimates. 


1 Kossoris, M. DÐ., “A Statistical Approach to Accident Prevention,” 
Journal of the American Statistical Association, Vol. 34 (1939), p. 526. 
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Retail-store executives frequently require weekly or even daily 
estimates on some articles, while sellers of perishable vegetables or 
other foods may even have to make hourly estimates. 

When the manufacturing executive has a satisfactory answer 
to the above type of question, he can schedule production to 
maintain as nearly level a rate as is feasible and to keep as 
constant a labor force as possible. In some large business 
enterprises statistics are assembled daily on working capital 
position, factory expense, output, and consumer credit extended. 
Control by the executive is kept flexible and timely by a con- 
tinuous stream of statistics both on the internal state of the 
business and on external economic conditions. As one rather 
erudite businessman says, ‘There has been an insistence from the 
very top of the organization on getting the facts, so that we might, 
to apply Descartes’s picturesque phrase, ‘be clear about our 
actions and walk surefootedly in this life.’ ’’! 

In his determination of policies regarding prices, production, 
and employment in his own business, the enterpriser must 
make judgments based upon knowledge of the world of prices in 
which he lives. Prices he must pay for raw materials, for labor, 
for equipment and its upkeep are his guide for determining his 
own production activity and the price he can eventually obtain 
for his product. Since all or at least part of the system of prices, 
that is, the prices he pays and the prices consumers pay for 
competitive or substitute articles, is beyond his control, the 
individual producer adapts his plans to any uncontrollable condi- 
tions he finds in the market. It is by the use of statistics that the 
modern businessman comes to understand conditions to which, if 
he is to profit, he must succeed in adapting his own business. 

During recent years polling agencies have been hired by busi- 
ness executives to obtain certain types of information with 
respect to potential markets and changes in consumer tastes or 
habits. Student groups and student publications on the cam- 
puses of colleges and universities are employed by businessmen to 
make widespread use of polling techniques. It has also been 
found that a carefully conducted student poll can do more 
to make college administrators and trustees cognizant of student. 
attitudes toward vital campus issues than the older and less 


1Hayrorp, F. L. “Some Uses of Statistics in Executive Control," 
Journal of the American Statistical Association, Vol. 31 (1936), pp. 31-37. 
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effective means of circulating petitions. In the large university 
the student poll performs many of the functions of the open 
forum in a small university or college. Similarly, merchants and 
classes in advertising can determine the efficacy of advertising 
by the extent to which students express a preference for branded 
and highly advertised cigarettes, toilet articles, school supplies, 
and items of clothing to the little or nonadvertised varieties. 
'The radio programs to which students listen, the magazines to 
which they subscribe, the amounts they spend for various 
budgetary items, the type of motion picture they most enjoy, 
the mileage they travel, and the means of transportation they 
prefer are typical items of information eagerly sought by adver- 
tising organizations and business firms in college and other kinds 
of markets as well.! 

In a wide variety of practical ways the statistical principle 
of sampling is used in business. For example, by the use of a 
small spectroscope, an entire trainload of pig iron can be tested. 
The spectroscopist opens the car door, fastens a wire to a sample 
pig, strikes an electric are between this and a bar of pure iron he 
carries, and observes the light in the spectroscope. The bands of 
color in the spectroscope reveal to him whether or not the amount 
of impurity in the pig is below a previously determined standard. 
By properly selecting sample pigs at random the trainload of 
metal can be tested before it is unloaded.? In a similar manner, 
though perhaps with less sensational instruments than the 
spectroscope, other types of more or less homogeneous or stand- 
ardized goods, such as shipments of ores, grains, oranges, potatoes, 
or lettuce, can be tested by sampling. : 

Education. "The grading and selection of teachers have in some 
instances been based upon intelligence tests, which have been 
perfected by the use of statistical technique correlating test 
grades with empirical results.? The scientific use of intelligence 

1 For further illustrations see, for example, W. B. Dygert, Radio as an 
Advertising Medium; H. E. Agnew and W. B. Dygert, Advertising Media; 
E. R. Walter, Effective Marketing; E. H. Schell and F. F. Gilmore, Manual 
for Executives and Foremen; and H. B. Maynard and G. J. Stegemerten, 
Operation Analysis. 

2 Harrison, G. R., Atoms in Action (1939), p. 165. Cf., on sampling for 
grading cars of iron ore, Stewart H. Holbrook, Iron Brew (1939), pp. 164-165. 

3 West, Micnazt, “The Psychology of the Teacher,” Journal of Educa- 


tion, March, 1939, p. 158. 
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tests has developed since the First World War. In 1917 a test 
called the American Army Intelligence Test was given to the 
drafted soldiers. The set of questions included on the Army 
Intelligence Test were based upon the cumulative experience, 
comparatively limited in extent with such tests up to that time. 
The war experience with the tests proved to be a landmark in 
their development in that it constituted a major experiment 
in their use and stimulated rapid development in the principles of 
their use.! Subsequently, the art of constructing questions for 
testing intelligence, now called “aptitude” in order to contrast the 
testing of natural ability with the mere testing of acquired ability, 
has greatly progressed. In addition to the college-entrance tests, 
which in part measure opportunity, scholastic-aptitude tests are 
used by the leading universities as a basis for selecting students. 
As a consequence, statistical data that measure not only acquired 
intelligence but also native ability, or aptitude, are being accumu- 
lated. The aptitude-test rating is often called the ‘‘intelligence 
quotient,” or simply I.Q. 

Mental tests have most frequently been employed with the 
feeble-minded in connection with problems of detection and place- 
ment and for determining the type of training best suited to 
individual persons. Studies of eriminals by the use of intelligence 
tests disclose relationships between intelligence and the type 
of crime committed, but apparently a high I.Q. neither prevents 
nor stimulates crime in general. Delinquent children have been 
found to exhibit more neurotic traits than do unselected school 
children. Tests of emotional control, dishonesty, and lack of 
self-control have been found useful in forecasting incorrigibility 
among delinquent children. 

Recently a study was made in which the I.Q.’s of 214 foster 
children, all of whom were adopted before the age of twelve 
months, and of 105 control children living with their own parents 
were compared with the I.Q.'s of the foster and real parents. The 
I.Q. of the parents was supplemented by information on occupa- 
tional status and other pertinent data. Information regarding 
the true parents of adopted children were secured from placement, 
records. There was far greater correspondence between the 

1 BRIGHAM, CARL C., “Two Studies in Mental "Tests," Psychological 


Review, Psychological Monographs, Vol. 24 (1917); A Study of American 
Intelligence (1923); A Study of Error (1932), 
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LQ.s of foster children and their true parents than between 
the LQ.s of foster children and their foster parents. It was 
estimated by statistical techniques that the contribution of 
heredity to individual differences in I.Q. is probably not far from 
70 to 80 per cent and that the very best environment might, how- 
ever, raise I.Q. as much as 20 points, while the poorest environ- 
ment might lower it as much as 20 points.* 

Sociology. Modern sociology employs statistical method 
almost to the exclusion of other methods. This may be a mis- 
taken emphasis that will be corrected by future sociologists, 
but in that discipline the twentieth-century reaction to nine- 
teenth-century abstraction was particularly great. Moreover, 
this extreme emphasis upon fact by American sociologists? can 
be traced to the picturesque Lester Frank Ward, who, despite 
the abstract qualities of his writing, emphasized the statistical 
approach. A farmer, a Civil War soldier, a Federal government 
official, a lawyer, a botanist, a chief of the Division of Navigation 
and Immigration, and, finally, toward the end of his life, a pro- 
fessor of sociology at Brown University, Ward came to the study 
of sociology with a richly varied experience. Among the voices 
raised against nineteenth-century emphasis on nature and the 
neglect of humanity his was the most vigorous. So eager was the 
reading world for this new approach that some of Ward’s books 
were translated into every Continental tongue.* 

Economic Theory. From Adam Smith to the present time 
economie theory has been, at least in part, an inductive science. 
In Adam Smith's day there were few statistics, but he made 
extensive use of trade, price, and wage data in his analyses. In 
modern times, especially since the turn of the century, more 


1 Study by Miss Burks, described in H. E. Garrett and M. R. Schneck, 
Psychological Tests, Methods, and Results (1933), pp. 189-190. 

2Lynp, R. S., and H. M. Lynp, Middletown: A Study in Contemporary 
American Culture (1929); Middletown in Transition: A Study in Cultural 
Conflicts (1937). These remarkable books are modern classics in sociology 
and are based almost entirely upon observational method largely statistical 
in character. 

3 CHUGERMAN, SAMUEL, Lester F. Ward, The American Aristotle (1939). 
Because of Ward’s optimistic views it has been suggested lately that he should 
be widely read both for information and encouragement. Cf. a review of 
Chugerman’s book by Prof. Rudolph Binder in The New York Times Book 
Review, Oct. 15, 1939, p. 10. 


12 INTRODUCTION 


2 

complete statistical data are available, and an increasingly 
important volume of statistical writings that have significance 
in economie theory has been forthcoming. The theory of alloca- 
tion of income to the owners of capital, to the laborers, and to the 
enterprisers, as well as other comprehensive economie hypotheses 
such as business-cycle theories and theories of money and prices 
are today being tested by careful statistical studies. In addi- 
tion, theoretical questions concerning the factors determining par- 
ticular prices are being studied by the use of statistical methods. 

Mathematical Economics. In recent years the subject of 
statistics has become closely related to the mathematical approach 
to economie theory. Starting with the nineteenth-century work 
of Cournot a group of mathematical economists have attempted to 
work out a purely abstract theory of economics by using the 
shorter and more precise methods of mathematical reasoning. 
The origin of this school of economists, often called the “ mathe- 
matieal school of economists," was independent of the develop- 
ment of statistics. As statistical methods became more refined 
and economie data more plentiful and more accurate, the mathe- 
matical school of economists turned to statistics to derive demand 
curves and supply curves from the actual statistical events of the 
market place. This development during the 1930's was one of 
the most sensational and also one of the most controversial 
contributions to economies, and it continues to be energetieally 
discussed in scientific meetings and journal articles by proponents 
and opponents of the methods used. Meanwhile, without waiting 
for the issue to be settled by the theoreticians, business enterprise 
and government, and notably the United States Department of 
Agriculture, are making extensive practical use of statistical 
demand and cost eurves.? 

Statistics and the Natural Sciences and Arts. Astronomy. 
One of the foundations of statistical theory, the method of least 


1 Cf. National Bureau of Economie Research, Studies in Income and 
Wealth, Vols. 1-3. National Resources Committee, The Structure of the 
American Economy (1939), Part I, Basie Characteristics. TINBERGEN, J., 
A Method and Its Application to Investment Activity (1939); Business Cycles 
in the United States of America, 1919-1932 (League of Nations Economic 
Intelligence Service, 1939). 

2 EzEK1EL, M., and L. H. Bean, Economic Bases for Agricultural Adjust- 
ment Act (1933); Scuunrz, Henry, Theory and Measurement of Demand 
(1938). 
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squares, was first discovered and applied in astronomy early in the 
nineteenth century. The method continues to be employed in 
astronomy to trace the paths of stars, comets, planets, and other 
heavenly bodies. Modern astronomy deals with large numbers 
of observations, which become the statistical raw material for the 
science. For example, the Harvard College Observatory receives 
monthly, from nearly one hundred different observers distrib- 
uted the world over, and on report blanks containing seven 
to seven hundred observations each, an average of forty-five 
hundred observations. It has been found best not to attempt to 
analyze each observer’s work separately, but instead to depend on 
multiplicity and frequency of observations well distributed 
throughout, to obtain the best possible light curves. Over fifty 
thousand observations come to the Harvard College Observatory 
each year, and from 1911 to 1939 it collected three-quarters of a 
million observations. ' 

For years the Smithsonian Institution has been using methods 
essentially statistical in nature to record measurements of the 
amount of heat received from the sun by the earth. Smithsonian 
stations in three of the most arid regions of the earth are daily 
recording the sun’s radiation. Observers in Chile, in South 
Africa, and in Western United States have been taking records. 
According to these observations, which have been made at widely 
separated stations, correlations exist between changes in solar 
radiation and temperatures on the earth. Study of these records, 
study of records of the earth’s weather as recorded in the growth 
rings of trees, and study of similar phenomena have revealed 
recurrent cycles in the weather that may be of great value in 
foretelling long-range trends in the future succession of fat and 
lean years.” 

Zoology. A considerable amount of the experimental work in 
the life sciences involves such quantitative considerations as 
weights, measurements, enumerations, pointer readings of various 
kinds, comparisons, and classifications. If the results arrived at 
by experimentation are to give rise to general principles rather 


1 Cf. CAMPBELL, Leon, “The Light Curve of SS Cygni, 213843," Annals 
of Harvard College Observatory, Vol. 90, No. 3, pp. 93-162; Srerne, T. E., and 
LEON CAMPBELL, “Properties of the Light Curve of SS Cygni,” ibid., Vol. 90, 
No. 6, pp. 189-206. 

? So says G. R. Harrison, op. cit., pp. 290-291. 
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than just to meaningless and incoherent single observations, the 
zoological data must be consistently assembled, uniformity of 
units must be observed, and the data classified. In other words 
statistical method must be used to bring order from isolated 
chaotic measurements. 

In addition to routine problems of analysis in zoology, sta- 
tistical and mathematical devices have had interesting applica- 
tions in certain special problems. For example, in 1934 Zeuner 
used a statistical study of a system of cranial angles as a basis for 
biological inferences regarding rhinoceroses; in 1930 Soergel 
emphasized the importance of statistical methods for certain 
paleontological problems, employing numerical and mathe- 
matical procedures to study footprints and from these drawing 
inferences regarding the animals that made them; and in 1912 
Ridgway attempted to put the study of faunal coloration on a 
statistical basis." Paleontologists use various mathematical 
and graphic means to restore missing parts in fossil animals and 
to reconstruct hypothetical intermediate stages between less and 
more specialized animals. They also use statistical methods 
to study averages and variation in characteristics of different 
age groups, rate of growth, and the like, of various animals.? 

Biology. Considering the modern emphasis on statistics in the 
social sciences it is interesting to. note, not only that the method of 
least squares was first applied in a natural science, but also that a 
second highly important statistical method was first; developed in 
the natural science of biology. This is the statistical measure- 
ment of correlation, which in the 1870's was used by Sir Francis 
Galton to measure the effect that characteristics of midparents— 
that is, the average of their two parents—had on their children.* 

Biological experimentation in the nineteenth and twentieth 
centuries involving as it does rats, guinea pigs, and the like, 
makes use of procedures that combine the laboratory test with 
the assembling of statistical data and their subsequent analysis. 
In this way, the effects and incidence of various diseases and 


!SoERGEL, W., Die Bedeutung variationsstatistischer Untersuchungen fiir 
die Süugetier—palüontologie, Bund 63, pp. 349-450; Riveway, R., Color 
Standards and Color Nomenclature (1912). Also see Simpson, G. G., and 
ANNE Rok, Quantitative Zoology, (1939), pp- 24, 404—406. 

? Simpson and Rok, op. cit., p. 335. 

3 See Chap. XIII. 
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of various eures for those diseases are measured; thus also are 
tested the various theories regarding the relative importance 
of hereditary factors as compared with environmental factors in 
animallife. Much of this experimental work later becomes the 
basis for theories regarding human life and for theories in respect 
to the effects of human diseases and their cure. 

Some problems in biology have interesting applications to 
the homely arts of living. A recent illustration of the use of 
statistics in biology is the standardizing of liquid household 
insecticides, a matter of considerable importance to certain 
private enterprisers engaged in the business.! By a series of 
experiments that established the sex ratio of houseflies statis- 
tically, hitherto unknown sources of variability in the effects 
of insecticides were thrown into bold relief. It was found, 
for example, that flies at ages of less than three days vary con- 
siderably in their reaction to the spray, while flies four to six 
days old exhibit a fairly constant susceptibility. It was known 
that male houseflies are markedly more susceptible to certain 
sprays than female houseflies. 

A recent book on heredity? illustrates the extent to which 
biology depends upon statistical technique. Widespread interest 
in the Dionnes led biologists to calculate the probability of 
quintuplets as compared with the probability of twins. The 
probability of quintuplets is 1/41,600,000, while that of twins 
is se, In addition, the statistical method was used in an interest- 
ing way to answer the question of heredity vs. environment, 
epitomized in the highly talented musical family of Johann 
Sebastian Bach, a talent that ran through five generations. 
Were the Bachs musical because of inborn talent or because 
of the musical environment in the home? To answer this 
question the author of the above-mentioned book resorted to 
statistical technique. He obtained information from 36 out- 
standing instrumental musicians, from | 36 principals of the 
Metropolitan Opera Company, and from 50 students of the 
Juilliard Graduate School of Music. From facts obtained by 


1 CAMPBELL, F. L., G. W. SxEpEcon, and W. A. SiwoxTox, “‘Biostatistical 
Problems Involved in the Standardization of Liquid Household Insecti- 
cides,” Journal of the American Statistical Association, Vol. 34 (1939), 
pp. 62-80. 

? SCHEINFELD, À., You and Heredity (1939). 
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questioning these persons, the author coneludes that their 
talent is largely inherited. Many will welcome this trend toward 
basing studies of man upon statistics of human beings rather 
than upon statistics of vegetables or fruit flies. 

Medicine. Much of the statistical work in biology has 
application in the field of medicine, and interest in statistics 
on the part of the medical profession has increased. In addition, 
the medical profession has become interested in statisties on 
economic and social welfare, factors of importance in the control 
of epidemics, and of certain types of disease in the modern com- 
munity.’ The practical advantages to the physician and to the 
sanitarian of the development of medical statistics are very 
great. Matters that were fiercely debated two generations ago 
and concerning which only few physicians of a hundred years ago 
could form an opinion are now a regular part of the knowledge 
of a junior medical student through the study of mortality 
statistics and vital statistics.2 Indeed, the medical profession 
in England has recently contributed a textbook on medical 
statistics designed to acquaint medical students with the funda- 
mentals of statistical theory.* 

Engineering. Since the success of their work depends not 
only on the machines but on the human beings who operate them, 
mechanical engineers have become increasingly interested in the 
use of statistical method for making time studies in machine 
operations. It is now realized that such studies cannot be 
safely based upon some a priori scale of the machine's capacity 
or upon the record of only one or two operatives. Rather, 
time-study data must be collected from an entire group of 
operatives so that adjustments can be made according to the 
effects upon operation of the human traits found by statistical 
study to prevail in the machine or in the manner of operation.* 

Two simple examples of the application of statistics to electrical 
engineering are the study of elevator capacity for buildings and 

1 Cf. Davis, Mengt, M., “Wanted: Research in the Economie and Social 
Aspects of Medicine,” The Milbank Memorial Fund Quarterly, October, 1935, 
pp. 339-346. 

* Cf. Peart, RAYMOND, Introduction to Medical Biomeiry and Statistics 
(1923), pp. 2, 38. 

* Hir, A. B., Principles of Medical Statistics (1939). 

* BEnGEN, H. B., “Scientific Management in Unionized Plants,” Mechani- 
cal Engineering, March, 1938, pp. 235-240. 
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telephone calls to be handled by an exchange. Statistics 
regarding the number of passengers taken on at the first floor 
are used to determine the time required for passengers to leave 
the elevator, the round-trip time, and the number of passengers 
carried by a given elevator. The most desirable type of elevator 
equipment to install is determined from such data.! 

Since engineers are dealing with natural phenomena, that 
cannot be affected by human bias, many of their problems can 
be solved approximately by the application of the principles 
of probability. For example, during a long period of gaugings of 
a stream the frequency of floods is often the best indication 
of probable future floods. Such important engineering data 
as forecasts of future floods, low annual rainfall, and consequent 
depletion of storage reservoir can be estimated by applying 
the theory of probabilities to statistics on the past history of 
Such events. From such data, the use of Statistical technique 
makes it possible to estimate the proper size of a hydroelectric 
power plant and to predict its output and earnings.” 

One of the most striking illustrations of the use of statistics 
in engineering is the control of the quality of manufactured 
products.’ In ordinary manufacture, with the exception of 
the making of optical or other precision instruments of infinite 
refinement, all units of a product are not identical, in spite of 
the vaunted standardization of products in industry. The cost 
of so refining the machines or of so regulating their operation as 
to make all units of product identical would be prohibitive 
and in most cases unwarranted because of the low market value 
of the product. Variations in quality are thus considered to be 
justified, and it is the purpose of quality control to develop 


1 Cook, H. B., “Selecting Elevators for an Office Building,” Power, Mar. 8, 
1932, pp. 404—408. 

2 CREAGER, W. P., and J. D. Justin, Hydro-electric Handbook (1927), 
pp. 43 and 171. For other illustrations of the use of statistics in engineering 
science and art, see C. W. Hubbard, “Investigation of Errors of Pitot 
Tubes,” Transactions of the American Society of Mechanical Engineers, 
August, 1939, pp. 477-506; H. K. Barrows, Water Power Engineering (1927), 
pp. 54-57. 

*Snukwnanr, W. A., Economic Control of Quality of Manufactured Product 
(1931). Since Shewhart’s pioneer efforts on this important subject, much 
progress has been made, so much that one might say a new eraft has been 


created. 
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statistical means of showing the actual statisties of variations 
in quality, the economically permissible variations in quality, 
and the statistical measurement of ways of locating and cor- 
recting causes of quality variation beyond set limits. Such 
control is designed to reduce the number of products that 
must be discarded as below standard; consequently, if successful, 
quality control reduces waste and lowers manufacturing costs 
per unit of output. In addition, selling costs are reduced and 
good will improved, because quality control decreases the 
number of customers who become dissatisfied as the result of the 
inconvenient necessity of returning inferior products. 

Although used in the American Telephone and Telegraph 
Company under the leadership of Dr. Shewhart, application 
of statistical quality control has been negligible in the United 
States. In Great Britain, however, the idea of statistical 
quality control was accorded an enthusiastic reception following 
Shewhart’s visit to London in 1932. A committee headed by 
Dr. E. S. Pearson was organized by interested British indus- 
trialists to consolidate previous progress and facilitate adoption 
of the technique. By 1937 in England the methods had been 
applied to coal, coke, cotton yarns, cotton textiles, woolen 
textiles, spectacles glass, lamps, building materials, and manu- 
factured chemicals.* 

Physics and Chemistry. There is no dispute among modern 
physicists and chemists as to the importance of statistical 
methods in their sciences. Even the highly metaphysical Sir 
Arthur Stanley Eddington in his Nature of the Physical World 
(1928) attaches great importance to statistics in the natural 


1 Two reasons have been given for this failure of statistical quality control 
to be applied in the United States: “[first,] a deep-seated conviction of 
American production engineers that their principal function is so to improve 
technical methods that no important quality variations remain, and that in 
any ease the laws of chance have no proper plate among modern ‘scientific’ 
production methods; second, . . . the difficulty of obtaining industrial 
statisticians who are adequately trained in this fairly complicated field.” 
Freeman, H. A., “Statistical Methods for Quality Control,” Mechanical 
Engineering, Vol. 59 (1937), pp. 261-262. 

2 Parson, E. S., The Application of Statistical Methods to Industrial 
Standardization and Quality Control (British Standards Institution, London, 
1935). 

3 FREEMAN, op. cil., pp. 261-262. 
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sciences. In fact, he says that the laws of nature divide them- 
selves into three classes, (1) identical laws, such as the law of 
conservation and the law of gravitation; (2) statistical laws, 
such as Boyle’s law, the second law of thermodynamics, and 
quantum laws; and (3) transcendental laws, which are “genuine 
laws of control in the physical world.’ 

In physics, statistical technique is employed in the study of 
molecules. This modern statistical approach has a philosophical 
background that goes back at least as far as Boltzmann, who 
in 1866 expressed the second law of thermodynamics in terms of 
probabilities. His contribution was regarded as a form of 
mysticism until it was demonstrated by research during the 
first two decades of the twentieth century.? At the turn of the 
twentieth century Max Planck was trying to explain why pieces 
of matter heated to high temperatures emit more light of one 
wave length than of any other and less light at both larger and 
shorter wave lengths. He could not explain this phenomenon 
except by supposing that light is emitted by atoms not as con- 
tinuous trains of electromagnetic waves but in discrete bundles 
of energy that he called “quanta.” Similar experimental 
work accompanied by new theoretical contributions, notably 
those of Heisenberg, led to the formulation of the modern 
statistical approach to the natural sciences. Within three 
decades this new theory has come into widespread practical 
use also, having found application in explanation of the behavior 
of photographic plates, the conduction of electricity through 
wires, the conduction of heat through walls, the behavior of 
photoelectric cells, the manner of emission and absorption of 
light by atoms and molecules, and in the theory of metals, 

As explained by a recent popularizer of the natural Sciences, * 
Newtonian mechanics succeeds in accurately predicting motion 


1 SrEBBING, L. S., Philosophy and the Physicists (1937), p. 70. 

2 Haas, ARTHUR, The New Physics (1923), pp. 38-44. 

* Cf. EBupnrpag, J. A., The Physical Basis of Things (1934), pp. 357-358. 

* De Bnoarrg, Lovis, Matter and Light. The New Physics, translated by 
W. H. Johnson (1939). For a popularized description of the experimental 
development based upon Boltzmann and later Heisenberg's theories, see 
also Eddington, op. cit. Cf. William M. Malisoff, review of De Broglie’s 
Mailer and Light, in The New York Times Book Review, Oct. 1, 1939. Also 
see H. Lifschutz and O. S. Duffendack, “The Counting Losses in Geiger- 
Müller Counter Circuits and Recorders,” Physical Review, Vol. 54 (Nov. 1, 
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occurring on the human scale and also on the scale of celestial 
bodies. In other words, Newtonian mechanics does well for 
macroscopic measurements. But in the investigations of the 
motion of the microscopic particles inside the atom, Newtonian 
mechanics ceases to have value, while quantum mechanics makes 
it possible to grasp the meaning of new principles that must 
necessarily be introduced in these more minute analyses. The 
principles referred to are statistical in nature and are based 
in large part on the theory of probability. “It is impossi- 
ble to measure several physical quantities (as energy, position, 
momentum) accurately at the same time. It is this neces 
sary inexactness that has forced us to find our ultimate laws in 
probabilities.” ! 

It must not be supposed that the new statistical approach 
which is said to have been derived from Heisenberg’s uncertainty 
principle, necessarily has thrown into chaos concepts of physica! 
measurement. The admission that laws in quantum mechanics 
are statistical may destroy the idea that the universe is a huge 
machine; but in a given case, with the initial conditions deter- 
mined as precisely as the principle of uncertainty permits, the 
probability of all subsequent states is determined by exact 
mathematical probabilities. There is nothing lawless in quantum 
phenomena.” Analysis shows, moreover, that the theoretical 
uncertainty, which prohibits a simultaneous accurate measure- 
ment of position and of velocity, is noticeable only in dealing with 
the very minute masses of the subatomic world. With ordinar y 
masses, the theoretical uncertainty, though still existing, falls 
below the practical uncertainties, which are due to the imperfec- 
tion of human observations, and is completely submerged by the 
latter. This gradual obliteration of the quantum uncertainties, 
as the scientific observer passes to the commonplace level of 
average masses, is the reason why Newtonian mechanics is still 
used. For the small velocities and relatively large masses with 


1938), pp. 714-725; A. Ruark, “The Time Distribution of So-called Random 
Events,” Physical Review, SCH 56 (Dec. 1, 1939), pp. 1165-1167; E. R. 
Rutherford, Radiations from Radioactive Substances (1930), Chap. VII, 
pp. 171-172. 

1 ELDRIDGE, op. cil, p. 376. Cf. Torman, R. C., The Principles of Sta 
tistical Mocs (1938), p. 65. 

2 STEBBING, op. cit., p. 183. 
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which the scientist is usually concerned, the two mechanics yield 
results so nearly alike that in practice no experiment would be 
sufficiently refined to detect the difference. Since the Newtonian 
mechanics is mathematically the simpler, there is every advantage 
in retaining it.! 

Except for Einstein's theory of relativity there has been nothing 
80 to stir the imagination of the natural scientists in the twentieth 
century as this new statistical approach. In fact, one writer has 
said that the entire structure of modern physics and chemistry, 
and therefore of all the natural sciences to which they are funda- 
mental, rests upon quantum mechanies.? 

From the above discussion it is readily apparent that statistical 
techniques are helpful, not only to theories in the natural and 
social sciences, but to the arts dependent on those sciences. Yet 
for many students the most important reason for knowing some- 
thing about the fundamentals of statistical method is the need for 
intelligent discrimination between the proper and improper use of 
statisties. Unfortunately, a large portion of the extensive 
modern employment of statisties in all fields falls under the latter 
heading. This is especially true in popular presentations of 
modern scientific and political matters. Too close attention to 
the mechanies of a method and the neglect of common sense are 
responsible for a large number of these horrible examples. All too 
often, preoccupation with the technique dims common sense. 

Statistics and Philosophy. . Nineteenth-century cocksureness 
of the scientific approach, pretending to such a degree of precision 
and to such broad scope as to annihilate the foundations for 
ethical, moral, and religious faiths, has largely disappeared. 
Under the aegis of the assertive and materialistic science of the 
nineteenth century, belief in free will was dwindling to a mere 
superstition; but the element of indeterminacy brought into 
science as a result of the application of the theory of probabilities 
again permits freedom. This decline of mechanistic assurance in 
science has not been ignored by philosophic thought, which has 
emphasized as never before a lesson that has often recurred in the 
history of philosophy: objective reality is not always identical 
with subjective concepts. Eddington expresses these doubts 


1 D'Anzo, A., The Decline of Mechanism (1939), pp. 37-57. 
2 HARRISON, op. cit., pp. 341-342. 
* ELDRIDGE, op. cit., pp. 379-380. 
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in the following words:! “Does the spectroscope find the colors, 
or does it make them? When the late Lord Rutherford showed 
us the atomic nucleus, did he find it or did he make it? How 
much do we discover and how much do we manufacture by our 
experiments?” 

Just as surely as the railroad destroyed the supremacy of the 
stagecoach or the radio eclipsed the popularity of the phonograph, 
so have the discoveries of modern science eclipsed faith in many 
ideals and beliefs that served to give reason to the lives of the 
masses of the people. In the realm of ethical and moral values, 
buttressed by the dogma of a bygone age, nineteenth-century 
scientific method was almost wholly destructive and hardly at all 
constructive. Modern philosophy criticized scientific method, 
both the laboratory and statistical branches, for failing to provide 
new moral values to replace outmoded prescientific ones. Despite 
this gloomy aspect, philosophy’s greatest spokesmen look to 
scientific method itself to obtain the necessary enlargement 
of the conception of human nature and the formulation of the 
required new moral values. John Dewey envisages the use of 
scientific method to create a comprehensive democratic culture as 
a guarantor of genuine freedom. 


SUMMARY 


Statistical method—the quantitative expression of knowledge, 
the marshaling of facts and their arrangement in a form suitable 
for serutiny—has been the means employed by businessmen, 
natural scientists, and social scientists to establish bases for 
judgments regarding factual data so complex or so numerous as 
to be, in the unmarshaled state, intellectually incomprehensible. 
Commercial statistics and their interpretation may, indeed, be 
said to constitute the scientific background of business today. 
Men cannot conduct their business intelligently without them. 
Quite as important as statistics of commerce and trade are the 
more recently developed industrial and social Statistics, data on 
employment and pay rolls in industry, trade, and finance and on 
the distribution of income. 

In the science of government and its practical art the sta- 
tistical approach has proved itself essential as facts have accumu- 


1 Op. cit., pp. 108-109. 
* Freedom and Culture (1939), passim. 
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lated and, to an increasing extent, as means have been developed 
for making the quantitative units of measurement required. 
The importance of statistical techniques in the natural sciences is 
attested to by the definition of “science” so familiar to every 
schoolboy: Science is systematized knowledge. Statistical 
mechanics is essential to an understanding of modern physics 
and chemistry. Whatever the individual's station or calling, he 
is to a greater or lesser extent using statistical techniques. 


CHAPTER II 
GATHERING STATISTICS 


Before the commercial revolution of the sixteenth century, 
social and economic life was relatively simple. The small villages 
and towns were self-sufficing economie and social units. Little or 
no statistical enumeration of facts was required to comprehend 
the extent of the population, the number of buildings, the number 
of cattle, and the quantity of other constituent units in the com- 
munity. Within the limited range of space and time usually 
contemplated, events having to do with the welfare or distress of 
the community were not complex. Judged by modern standards, 
government. was simple and inexpensive because social and 
economic relationships were not complicated. Even the great 
cities of the time were not large compared with modern metro- 
politan districts. In population, wealth, and trade, the extent 
of asixteenth-century nation was inconsiderable and, furthermore, 
was growing almost imperceptibly. In other words, conditions 
were relatively simple and static. 

Genesis of Fact Marshaling. Under such conditions, little 
was done in the way of the systematic gathering and analysis 
of statistical data. The situation did not demand the con- 
tinuous assembling of up-to-the-minute facts. Indeed, it was 
not profitable to do so. The motive did not exist in sufficient 
force to direct attention to the problem of expressing quanti- 
tatively the events of contemporary social and economic life; 
and the facts of the natural sciences were obscured in medieval 
mysticism or cherished from a forgotten age by a few scattered 
and scholarly churchmen. Nevertheless, it was found useful 
on occasions to make great surveys that could subsequently 
serve as the basis of governmental decision in regard to taxation 
and other social activities, and that might also be a guide to 
private enterprise. Pepin the Short in 758 and Charlemagne 
in 762 demanded detailed descriptions of church lands, while 
several works written in France during the first half of the ninth 
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century gave a partial enumeration of the serfs attached to the 
land.! Likewise, when William the Conqueror undertook the 
reorganization of the national government of England in the 
eleventh century, he found it desirable to make his famous 
survey, which resulted in Domesday Book, completed about 
10862 Also, as early as the fourteenth century, the medieval 
guilds gathered statistics in connection with their regulation of 
markets. Later, in the fifteenth century, as the breakup of 
the medieval system gathered momentum and as the rise of 
trading groups accelerated, there was a great increase in the 
amount of statistical work done by guilds as well as by central 
governments—the latter not infrequently through guild organiza- 
tions or through the Church. Economic statistics were collected 
when the occasion demanded, for example, when the upsetting 
of a customary price by a flood or drought required explanation 
and the determination of a new customary price. Although 
there is evidence that in these several ways statistics were 
assembled, they were neither methodically made nor preserved. 
There are isolated instances of the registration of deaths or 
baptisms in the fourteenth ‘and fifteenth centuries, but it was 
not until the sixteenth century that any considerable movement 
toward statistical enumeration of facts occurred, a 

Development of a Dynamic Social Order. During the Renais- 
sance, from thirteenth-century Italy to fifteenth-century Spain 
and England, the quantity of data in the physical sciences 
steadily accumulated from experimental efforts of astronomers 
and other scientists. The most dramatic of all human experi- 
ments was made by the voyagers seeking to prove that the world 
isround. The discovery of America and the voyages of explora- 
tion of the sixteenth century gave great impetus to the develop- 
ment of trade and the growth of nations. Motivated by the 
economie ideals of mercantilism, a period of trade development 
followed, the domestie system of manufacture rapidly expanded, 

! WALKER, Heren M., Studies in the History of Statistical Method (1929), 
pp. 32-33. The History of Statistics (compiled and edited by John Koren, 
1918). 

UM Epwarp P., An Introduction to the Industrial and Social 
History of England (1925), pp. 17-18. 

*Faung, FERNAND, “The Development and Progress of Statisties in 
France," in Koren, History of Statistics, pp. 229-233, 

4 FAULKNER, H. U., American Economic History (1931), pp. 34-57. 
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and the colonial empires of Portugal, Spain, France, Holland, 
and then England, emerged. The change was from haphazard 
and occasional trade of the merchant adventurers of the sixteenth 
century to the more or less systematic and regular international 
and intercolonial trade of the seventeenth and eighteenth cen- 
turies. Along with this trade development came the necessity 
for obtaining more regular information concerning markets, 
population, wealth, prices, and the movements of merchandise 
and gold. Furthermore, with this growth of trade both in 
volume and in complexity, governmental and social organization 
became more complex. 

As the fact of change was revealed by the events of the com- 
mercial revolution, national governments began to feel the 
need of more regular fact finding in order to visualize and to 
interpret changing conditions. Yet it must not be supposed 
that well-organized or any considerable amount of statistical 
data for the sixteenth or even the seventeenth century can now 
be found. It was more a case of an awakening of the will to do 
rather than a case of actual accomplishment. For it was really 
the Industrial Revolution and the vigorous growth that took 
place in the eighteenth and nineteenth centuries that gave the 
actual impetus to systematic marshaling of quantitative facts. 
It was not until the early part of the nineteenth century, indeed, 
that most of the essential principles of statistical method, even 
for purely descriptive purposes, had been evolved. Also, the 
compilation and current use of statistics as practiced today 
have been made possible only by the growth in transportation 
and communication facilities, a nineteenth-century phenomenon. 
It was also from the eighteenth century onward that the achieve- 
ments of scientists in accumulating experimental data for the 
natural sciences fired the imagination of scholars to solve the 
problem of data accumulation for the sciences of life and social 
behavior. 

Quantitative Expression of Facts. Where there are large 
populations, great nations of tens of millions of people, all 
problems of social, economic, and political organization are 
increased many times in complexity and, furthermore, new 
problems arise. The problem of feeding such large populations, 
the problem of housing them, the problem of keeping them 
employed and preventing them from harming each other, to 


GATHERING STATISTICS 27 


€ 
mention but a few of the considerations confronting the govern- 
mental administrator—all these are vastly complex owing to the 
great expanse of geographical space covered and the varying 
conditions at different places and times, In simple economies 
many of these problems ean be solved by permitting individual 
freedom of choice and free economie enterprise; but as the 
community becomes more and more closely knit in economic 
and social relations and as various forms of economic power 
emerge, individual freedom of choice and free economie enter- 
prise become goals that must be consciously sought by organiza- 
tion rather than natural tendencies that develop unaided. : 

In the intricate social and economie organizations of the 
modern era, it is inconceivable that any individual or group of 
individuals can obtain the knowledge necessary to form judg- 
ments concerning the issues that arise. An individual can 
comprehend only those conditions within a reasonable geo- 
graphical area about him ; the more complicated society is, the 
smaller the area about him that he can understand without the 
use of statistics. “We are overwhelmed, not only by the diver- 
sity of knowledge, but also by the diversity of possible deeds, of 
possible values, and of possible judgments,” and, further, “this 
human mind, whose needs Plato so perfectly understood, still 
insists upon constructing for itself a fixed world in the midst 
of a fluid one. It persists in thinking in terms of aims and 
ends and perfections; of ideals, of purposes, and final goods; 
and, at the very last, it insists upon assuming some direc- 
tion in change, something toward which the chain of events is 
moving.” 

In this effort it is impossible for the individual to survey the 
Conditions qualitatively—it would take him many human life- 
times to inspect the whole population, and the capacity of the 
human brain is not adequate to the task of absorbing so complex 
an impression. If he attempts a microscopic survey, he is 
quickly smothered by overwhelming detail. If he attempts a 
macroscopic survey without the use of statisties, he is compelled 


1 For the complexity of modern Society, as it is reflected in statistics, see 
publications of the United States Bureau of the Census, For suggestive 
special studies, see Corrington Gill, Wasted Manpower (1939); Henry Pratt 
Fairchild, People (1939), 

* Karen, J. W., Art and Experience (1932), pp. 121, 211. 
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to resort to guesswork and commonly originates *eloud-push- 
ing” fantasies. Furthermore, the individual’s personality tends 
to bias him not only in his observations but also in his judgments. 
If he is temperamentally inclined to be impressed by sordid 
things, he is likely to notice them more than the good things in 
his surroundings, and his judgment is correspondingly pessi- 
mistic. On the other hand, if he is temperamentally optimistic, 
he tends to consider as the rule the good things and to regard 
as unusual the sordid things of life. Where it is necessary to 
gain knowledge concerning large populations of people and 
things, where social and economic life is complex, there is need 
to use statistics. 

Rational Basisfor Gathering Data. Quixotically, accumulating 
data is not to be confused with scientific fact gathering. The 
progressive accumulation of useful quantitative facts has been 
stimulated and furthered by a definite, conscious purpose. 
To look at the process historically, it was the rise of nationalism 
and the mercantilistic ideal that supplied definite purpose for 
the fact-finding inquiries of the eighteenth-century political 
arithmeticians. Modern survivals of the same nationalistic 
and mercantilistic ideals impel governments to spend vast 
sums for the collection of statistics designed to measure the 
wealth and material position of the nation and to furnish business 
enterprise with facts about markets. Underlying much of this 
effort abides also a sincere interest, stimulated by scientific 
research, in real human welfare. As a consequence, the modern 


Units of Description and Measurement. The units of deserip- 
tion or measurement by means of which quantitative facts 
such units must be strictly applied during the assembling of the 
data and in all Subsequent analysis. Tt is accordingly of the 
utmost importance clearly and fully to describe units of descrip- 
tion and measurement in all subsequent use of the data.. Such 
rules are so clear and so easily resolved into matters of simple 
common sense that it seems almost a waste of time to direct 
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attention to them; yet to follow them is not always so simple a 
matter as might be supposed. 

For example, in 1940 thousands of enumerators undertook 
the task of counting the population of the United States, of 
counting the number of farms, farm animals, and all other types 
of wealth, and of obtaining specified information concerning 
every person living in the United States. One may ask: Why 
should mere counting be a complicated task? This question 
would be quickly answered by the farmer’s boy who has just 
finished trying to count the number of chickens in his pens. 
Everything would be easy if they would only stand still, People, 
as Well as chickens, do not stand still while they are being 
counted, and simple matters mount up into a veritable host of 
intricate difficulties. Suppose you were an enumerator and 
in the first house approached you found that the mother is in 
the maternity hospital, a baby was born at 10 A.M. of the census 
day, one son is away at college in another state, a daughter is 
boarding and rooming in a neighboring town, where she teaches 
school, and the father is in jail for evading income taxes. On 
several points you would feel the need for very specific instruc- 
tions. To avoid double counting or the failure to count many 
individuals, instructions to the enumerators must be given with 
great care; every possible complication must be foreseen. 

In recording facts about manufacturing and trading, or 
merchandising, enterprises in separate categories, when is an 
enterprise a manufacturing concern and when is it a trading, 
or merchandising, concern? In recording statistics about farms, 
when is a farm not a farm but a truck garden? These few 
examples are probably sufficient to emphasize the point that the 
unit must be carefully defined and that the defined unit must be 
strictly followed and freely or even religiously disclosed to all 
who in the future use the statistics. 

Carefully planned schedules of questions, often called “ ques- 
tionnaires,” are the principal means of gathering statistics. 
These vary from schedules simple enough for oral presentation, 
as frequently utilized in polls, to the elaborate forms used by the 
government or research organizations. In the first phase of 
statistical investigation, the gathering of facts, care in following 
all the rules of common sense and logical definition is epitomized 
in the formulation of the questionnaire, or schedule, 
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QUESTIONNAIRES OR SCHEDULES 


Official Example of Care in Description of Units. In taking the 
United States census, for example, the assurance of accuracy in 
regard to these important but detailed matters is guarded by a 
skillfully organized system. Forms are supplied, with column 
arrangement for writing in all the required information. A 
question appears at the head of each eolumn, and the columns, 
and therefore the questions, are grouped into subjects; thus in the 
schedule for the 1940 population census there are 34 columns 
grouped under the subjects location, household data, name, rela- 
tion, personal description, education, place of birth, citizenship, 
residence, and employment status. In addition, columns 35 to 50 
contain supplementary questions to be asked only a sample of all 
the persons enumerated. Figures 1 to 3 show in three sections 
the 34 questions asked of all persons. Figure 4 is included to 
show the questions on employment status in the 1930 census, 
revealing thereby the great elaboration of this type of question in 
the 1940 census. 

Sample forms that had been filled in with illustrative answers 
were supplied to enumerators, and a complete, simple description 
of the manner in which the form was to be filled out was printed on. 
the sample schedule. Pamphlets were issued for the use of 
enumerators, giving detailed instructions. ` For the 1940 census 
there were issued to enumerators taking the census of population 
and agriculture a printed and indexed pamphlet 173 pages long. 
"This gave detailed definitions and described procedure for enu- 
merators to follow under the various circumstances that might 
arise in their house-to-house canvasses. 

Moreover, the enumerators worked directly under experieneed 
district supervisors, who, in turn, were under area managers 
responsible to the Bureau of the Census in Washington. To train 
the 529 district supervisors in the 1940 census taking and census 
procedures, a picked group of more than one hundred men from all 
parts of the country were given a special course of instructions in 
Washington. Those who passed the examination were sent out as 
area managers to the 104 census areas, each to direct the training 
of five to seven district supervisors and to act as regional manager 
between them and the Bureau of Census in Washington. 

The 529 districts were broken into enumeration districts of 
which there were about 147,000. Generally speaking, there was 
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one enumerator in each of these districts, but in certain regions 
one enumerator covered more than one distriet. "Therefore, 
about 123,000 enumerators were used. Wide publicity for such 
careful preparation in the ease of the 1940 census resulted from 
Congressional protests about some of the new questions. ! 

To illustrate the necessity for careful definition of units and 
description of procedure and to solve the census problem of the 
amazing family described above, the following is quoted from the 
Instructions to Enumerators:* 


Wuo Is to Be Envmeratep IN Your Disrricr 


300. The problem of who is to be enumerated in your district is 
extremely important. Therefore, study very carefully the following 
rules and instructions. 

301. The Census Day. There should be a return on the population 
schedule for each person alive at the beginning of the census day, i.e., 
12:01 a.m. on Apr. 1, 1940. Thus, persons who died after 12:01 a.m. 
should be enumerated; and infants born after 12:01 a.m. on Apr. 1, 1940, 
should not be enumerated 

302. Usual Place of Residence. Enumerate every person at his 
“usual place of residence.” This means, usually, the place that he 
would name in reply to the question “ Where do you live?” or the place 
that he regards as his home. As a rule, it will be the place where the 
person usually sleeps. 


Persons to Be Enumerated in Your District 


304, Enumerate all men, women, and children (including infants) 
whose usual place of residence is in your district or who, if temporarily in 
your district, have no usual place of residence elsewhere. Persons who 
move into your district after Apr. 1, 1940, for permanent residence 
should be enumerated by you, unless you find that they have already - 
been enumerated in the district from which they came. 

305. Residents Absent at Time of Enumeration. Some persons having 
their usual place of residence in your district may be temporarily 
absent from the household at the time of the enumeration. These you 
must enumerate with the other members of the household, obtaining 
the information regarding them from their families, relatives, acquaint- 
ances, or other persons able to give it. However, do not include with 


1 The New York Times, Feb. 27-29, Mar. 1-3, 1940. =. 
* Bureau of the Census, Instructions to Enumerators, Population and 
Agriculture, pp. 14-18, 80-81. 
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in a university or professional school there will be a considerable num- 
ber of the older students who are not members of any household located 
elsewhere. Find and enumerate all such persons. 

319. Schoolteachers. Enumerate teachers in a school or college at 
the place where they live while engaged in teaching, even though they 
may spend the summer vacation at their parents’ home or elsewhere. 

320. Student Nurses. Enumerate student nurses as residents of the 
hospital, nurses’ home, or other place in which they live while they are 
receiving their training. 

321. Patients in Hospitals, Sanitariums, and Convalescent Homes. 
Most patients in hospitals, sanitariums, and convalescent homes are 
there temporarily and have some other usual place of residence. Enu- 
merate patients as residents of such an institution only if they have no 
other place of residence from which they will be reported. A list of 
persons having no permanent homes can usually be obtained from the 
institution records. 

322. Inmates of Prisons, Asylums, and Institutions Other than H ospi- 
tals. Your district may include a prison, reformatory, or jail, a home 
for orphans, for aged, infirm, or needy persons, for blind, deaf, or incura- 
ble persons, a soldiers’ home, an asylum or hospital for the insane or the 
feeble-minded, or a similar institution in which the inmates usually 
remain for long periods of time. Enumerate all the inmates of such 
institutions at the institutions. Note that in the case of jails you must 
enumerate the prisoners there, however short the sentence. 


CENSUS OF AGRICULTURE 
General Information 


Purpose of the Census of Agriculture. An act of Congress provides 
that a census of agriculture be taken every 5 years, for the purpose of 
obtaining basic information on farm acreage, land values, crops, live- 
stock, and other general items relating to agriculture. The Sixteenth 
Census, which will be taken as of Apr. 1, 1940, will include compre- 
hensive information on agriculture, including irrigation and drainage of 
farm land. 

Every enumerator must fill out a farm and ranch schedule for each 
tract of land in his enumeration district that might classify as a “farm” 
under the census classification, giving all the requested information. 
The information should be obtained by a personal visit of the enumer- 
ator. It is absolutely necessary that the census be complete and accu- 
rate. Census data are widely used by both private and publie agencies 
and often form the basis for legislative and administrative programs. 
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The farmer should be made to feel that his contribution to the census 
is of real value to himself and to his community. 

Census Schedules Are Confidential. The Federal law providing for the 
census prescribes heavy penalties for revealing information to unauthor- 
ized persons. "The enumerator should make it clear, in dealing with 
persons who seem unwilling to give the information requested, that he 
is not allowed to give any information to their neighbors or other 
persons; that only sworn census employees will have access to the farm 
schedules; and that those records for individual farms cannot be used 
Tor purposes of taxation, regulation, or investigation. 


Definition of a Farm. The definition of a farm found on the face of 
the schedule must be carefully studied by the enumerator. Note that 
for tracts of land of 3 acres or more the $250 limitation for value of 
agricultural products does not apply. Such tracts, however, must have 
had some agricultural operations performed in 1939 or contemplated in 
1940. A schedule must be prepared for each farm, ranch, or other 
establishment that meets the requirements set up in the definition. A 
schedule must be filled out for all tracts of land on which some agri- 
cultural operations were performed in 1939 or are contemplated in 1940 
and which might possibly meet the minimum requirement of a “farm,” 
When in doubt, always make out a schedule. 


You now have instructions that will help enumerate the inter- 
esting family first encountered above—the mother will be enu- 
merated (paragraph 321), the baby will not be enumerated 
(paragraph 301), the son will be enumerated (paragraph 318), the 
daughter will not be enumerated (paragraph 319), and the father 
will not be enumerated at the household, although if the jail is in 
town he will there be enumerated (paragraph 322). 

Figures 5 and 6 are photographic reproductions of parts of the 
Farm and Ranch Schedule used for the census of agriculture. On 
Fig. 6 appears the definition of a farm to which reference is made 
in the general information quoted above from the manual of 
instructions. Altogether the farm and ranch schedule contains 
232 questions on 16 subjects. The subjects include information 
about the operator, farm acreage, values, farm mortgage and 
taxes, irrigation, cooperative selling and purchasing, farm labor, 
farm expenditures in 1939, farm machinery and facilities, live- 
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stock and livestoek produets, erops harvested on the farm in 1939, 
and value of produets used and of forest produets sold in 1939. 

Only trained enumerators ean suceessfully use sueh elaborate 
questionnaires as those illustrated; only when properly instrueted 
can enumerators know how to get the information requested in 
each question of these complex schedules. In some cases, the 
questionnaires, or schedules of questions, are tried out by a 
person-to-person call at the sources of information in adyance of 
collecting the data for the final enumeration. There was during 
the summer of 1939 a trial census, covering a sample area in 
Indiana, taken by the United States Bureau of the Census while 
formulating the new and more complicated 1940 census schedule. 

Statistics Obtained from Samples. Every twentieth person 
enumerated on the 1940 census was asked supplementary ques- 
tions. The results constitute a 5 per cent sample. For the 
sample of population, the following subjects were covered: the 
usual occupation, industry, and worker class as a supplement to 
information obtained concerning present occupation, in order to 
determine the availability of and shifts to various kinds of labor; 
whether the respondent has a Federal social-security account 
number and whether wage deductions have been made for Federal 
old-age insurance during the 12 months ending Dec. 31, 1939; 
data showing the number of children born to women who are or 
have been married (women married, widowed, or divorced), to 
make studies of differential fertility; mother tongue, or native 
language, obtained by a question asking what language was 
spoken in the home in earliest childhood; the status of veterans of 
foreign wars and their wives, widows, and children; and informa- 
tion concerning the place of birth of the father and the mother of 
the respondents. 

This is the first decennial census in which the sampling process 
has been applied, and the results of the experiment are eagerly 
awaited by statisticians everywhere. While the decennial census 
has always been presumably a complete enumeration, other gov- 
ernmental statistics have frequently been drawn from samples. 
Indeed, because of limited funds, it is necessary for the Bureau of 
Labor Statistics to resort to the sampling method to obtain data 
on wages and hours in industry. Preliminary to the collection of 
such data the census data for the industry are studied to deter- 
mine in which states the industry is of material importance. 
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Manufacturers’ directories are examined and books and periodi- 
cals relating to the industry are read, thus obtaining the essential 
a priori background of knowledge to form the basis for sound 
proportional sampling.! 

To obtain the data on wages and hours of labor, the Bureau of 
Labor Statistics uses carefully prepared and elaborate question- 
naires, one of which is illustrated in Figs. 7 and 8. Trained 
agents obtain the information from a responsible official of each 
local union. Each scale of wages and hours is verified by the 
union official interviewed and is further checked by comparison 
with the written agreements when copies are available. For 
example, in the building-trades survey for June 1, 1939, inter- 
views were obtained with 1,551 union representatives, and 2,729 
quotations of scales were received. The union membership 
covered by these contractual scales of wages and hours was 
approximately 444,000. Great care is exercised to see that the 
agents are adequately trained to collect the data; written instruc- 
tions are supplied them by the Bureau of Labor Statistics, 
in which they are cautioned as follows: 


In the final analysis the accuracy and value of the entire survey must 
rest upon the agents who collect the data. These data must be abso- 
lutely correct and so presented on the schedule as not to be confusing 


or ambiguous. Each agent is, therefore, requested to study thoroughly’ 


the instructions, not once but repeatedly, and to question any point 
therein which may not be perfectly clear. It is extremely important 
that the agent check every schedule carefully before mailing to the office 
to be sure that each item is correctly entered and explained. When 
agreements accompany the schedules, the agent must compare each 
quotation with the provisions in the agreement and must explain any 
differences. 


In order to ensure the collection of comparable data from all 
agents, the instructions give painstaking definitions of “union 
scale," “collective agreement," “apprentices” and “foremen,” 


1 For further details of the methods employed by the Bureau of Labor 
Statisties, see “ Methods of Procuring and Computing Statistical Informa- 
tion of the Bureau of Labor Statistics (1923)," Bulletin 326; also “Union 
Scale of Wages and Hours in the Building Trades, June 1, 1939,” Serial 
R1034, from Monthly Labor Review, November, 1939. 

* Bureau of Labor Statistics, “Instructions for Survey of Union Seales of 
Wages and Hours, 1939” (No. 7468), p. 1. 
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12, Remarks or amplification of data reported on front of schedule .... 


13. Have there been any any changes in rates since June 1, , 1939, or have any changes been agreed upon to become 
effective in the near future? 


Date efective Occupation Rote 


14. Agreements are negotiated with: 
(0) Individual employers . (b) Employers’ association(s) ... 
1E ©), what proportion of union firms participate or are represented in the association()? ... 


For Building Trades Only " 
15. Does union cooperate with employer in establishing or enforcing standards of fair competition? 


Yes] Noc Ittheroisa written code of fair competition please attach a copy. If oral or customary 
arrangements, please explain . 


16. What proportion of the 1- and 2-family dwellings in this community are boing built under union conditions 

so far as this trade is concerned? .... 

Does the union hafÑ a special rate for this typo of work? Yes D No D) 
Explain 


Fia. 8.—Second portion of questionnaire used by the Bureau of Labor Statistics 
to obtain data on union wages and hours, 
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«union rates” and “actual rates," “union rates” and “prevailing 
rates," and “averages.” 

Study of Family Income and Expenditures. Tn 1929 the Social 
Science Research Council suggested the advantages of conducting 
a study of consumption in such a way that the sample would cover 
a wide range of incomes, all types of natural families, and all 
occupations within representative communities of different 
sizes. Income data and certain other facts would be collected 
from all families visited, through the use of a short schedule. 
These data would provide the basis for selection of an adequate 
number of families in each income class to furnish more careful 
estimates of income and the details of expenditures. Following 
these suggestions, the National Resources Committee and the 
Bureau of Home Economics of the United States Department of 
Agriculture completed in 1939 a study of family income and 
expenditures. Figure 9 shows the questionnaire used.! Tables 
of data based upon this questionnaire are shown in Chap. IV. 
It may be noticed that the type of question and indeed the whole 
schedule are much less complex, involving much simpler units, 
than any thus far illustrated. It was necessary for this schedule 
to be simpler than those discussed above because for the con- 
sumer-income study the agents were not so well trained as are, for 
example, the regularly employed field agents of the Bureau of 
Labor Statistics. 

Mailed Questionnaires. In some cases, especially where the 
schedule of questions is comparatively simple, questionnaires 
are sent through the mail to the sources of information. Such a 
method may be used either where the units involved are very 
simple or where those who are filling out the questionnaires are 
known to be qualified to do so. The United States Bureau of the 
Census and the Bureau of Labor Statistics have been able to 
use this method to obtain certain types of information from 
manufacturing concerns regarding employment, pay rolls, manu- 
facturing output, labor turnover, and the like. The method 
appears to be most used where fairly simple facts are collected at 
regular intervals. Data on pay rolls and employment are 


1 Bureau of Home Economics, U.S. Department of Agriculture, “Con- 
sumer Purchases Study,” Part I, Family Income, Miscellaneous Publication 
339, pp. 338-339; cf. National Resources Committee, Consumers’ Incomes in 
the United States (1938), p. 49. 
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obtained by mailed questionnaires monthly by the Bureau of 
Labor Statistics from representative manufacturing establish- 
ments in 90 manufacturing industries.‘ Figure 10 is an illustra- 
tion of the type of letter used by such agencies to secure the good 
will and cooperation of businessmen.? 

Where the questionnaire-by-mail method is used, the returns 
must be carefully edited and subsequent correspondence is 
frequently required to correct mistakes made on the returns. 
Manufacturing and merchandising concerns in this country have 
become trained in the matter of filling out questionnaires for the 
government through years of practice so that there has been built 
up a cooperative enterprise between the government and business 
in the gathering of business statistics. Although sometimes feel- 
ing the heavy burden of filling out numerous forms of this type, 
business is nevertheless glad to cooperate because it is eager to 
see each month the compilation of business data that emanates 
from government sources. 

Income-tax returns are of the nature of questionnaires and 
are a source of many important statisties. Everyone is familiar 
with the care necessary in the examination of the units involved; 
everyone who has had to handle a return or listen to the head of 
the family talk about it knows how detailed and specificare the 
printed instructions accompanying each form on which the return 
is made. In the case of the income-tax return, which frequently 
becomes so complicated as to require legal advice and expert 
accountants, the penalty for failure to file a return is sufficient 
to supply any incentive needed to overcome all obstacles. For 
failure to supply information for the other types of questionnaire 
that have been discussed, with the exception of the census, there 
is no similar penalty—the business concerns fill out such ques- 
tionnaires in a spirit of public service and to obtain the resulting 
compilations of data, 

Rules for Constructing Questionnaires. Any investigator who 
is tempted to seek information by the questionnaire method 
will be well advised to spend considerable effort first, to make 
certain that the facts are not already available, and then to 


1 Bureau of Labor Statistics, “Employment and Pay Rolls," Serial R1052, 


November, 1939, pp. 7, 11, and 16. 
2 This letter was used in January, 1940, with a new questionnaire revised 
to obtain better monthly data on labor turnover. 
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U.S, DEPARTMENT OF LABOR 
BUREAU OF LABOR STATISTICS 
WASHINGTON. 


st important re- 
into two groups: 
T a separation of 


the separation of accessions 
the first, to show the number of workers rehired afte: 


three months or less; the second, to include all other workers hired. 
The purpose of this breakdown, of course, is to determine whether or 


not accessions occur in connection with new job opportunities or 
whether they are the result of temporary Suspensions, 


We have also provided space to report separately changes in 
clerical, sales and supervisory personnel, so as to permit tabulations 
for the turn-over of these employees. If it 4s difficult or impossible 
for you to report this information separately, we shall appreciate the 

ita either for the total of all employees or, if this too is not 
sible, for plant employees only, 


We sincerely hope that our labor turn-over reports based on 
this new procedure will be more useful and valuable to you, and we 
shall greatly appreciate your continued cooperation, 


Isador Lubin, 
Commissioner of Labor Statistics 


Fic. 10.—A typical letter from the Bureau of Labor Statistics seeking to secure 
the good will and cooperation of businessmen in the reporting of statistics. 
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investigate well the pitfalls of questionnaire making, which is 
a highly specialized art. There are six fundamental but simple 
rules to be followed: 

1. The interest of the recipients of the questionnaires must be 
aroused or their cooperation obtained through some means. 
This may be done by engaging the support of some organization 
with which the individual informants are associated. For exam- 
ple, if the questionnaire is to go to bankers, the support or endorse- 
ment of the American Bankers Association should be enlisted. 
Interest in the questionnaire may also be aroused by the promise 
to furnish free copies of the summarized information when com- 
piled. In this manner and by the promise of. Secrecy regarding 
individual returns, various governmental units obtain great 
quantities of statistical information. 

2. The questionnaire should be as Short as possible, consistent 
with the scope of information sought; and the individual ques- 
tions should be so formulated as to be free of all ambiguity. 
They should be simple. Avoid presenting “problems” that will 
puzzle the recipients of questionnaires, 

3. Where possible, arrange the individual questions so that 
replies ean be brief and unequivocal. “Yes” or “no” or perhaps 
merely a check mark is the ideal answer. 

4. The letter transmitting the questionnaire should be 
brief and dignified and yet should “sell” the idea to the 
informants. 

5. After all is prepared, try out the questionnaire along with 
the transmittal letter on a dozen or so of the potential question- 
naire recipients in order to make final revisions before printing 
the questionnaires, or schedules, 

6. Always include a self-addressed stamped return envelope. 

The first five rules apply whether the questionnaire is to be 
used by trained enumerators or to be sent by mail, but special 
care must be exercised if sent by mail. Study of Fig. 9 will 
reveal that answers to all questions are quite simple, in some 
cases merely a check mark (see questions VI, 1, 2, 4, and VIL), 
in other cases the entry of a familiar numerical item. Less 
highly trained enumerators are required for handling such a 
questionnaire than are required for handling the United States 
census schedules, 
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EDITING 


When the questionnaire is received from the agent or from the 
respondent by mail, it must be examined. If any statement on 
the schedule conflicts with other statements or if the schedule 
is incomplete or lacks clearness, it may have to be returned to 
the agent or respondent for explanation or revision. This is 
called “editing” the returns or the schedules. In any ease, a 
certain amount of editing must always be done before tabulation 
of the data is begun. When trained visiting enumerators have 
been used in the survey, there will, of course, be a minimum 
of mistakes. When the questionnaires have been filled out by 
the informant directly, it may be necessary to write for further 
information or for corrections because of inadvertent mistakes 
in replies. If the respondents have been interested sufficiently 
to return the questionnaire with answers filled in, they will 
probably be willing to answer further simple questions to eluci- 
date their former replies. If it is believed that the information 
has been deliberately falsified or withheld, it may be necessar; 
to discard the entire schedule or at least the replies in it that 
seem to be of doubtful truth. 

Editing the schedules is the process of preparing the original 
statements in the schedule for classification, coding, and tabula- 
tion. Careful editing is necessary in order to obtain compilations 
of data that will truly reflect the conditions being investigated. 
One task of editing is to see that all figures entered on the return 
are clear. If not, the editor rewrites the figures. If so poorly 
written that even the editor cannot read them, the schedule 
must be abandoned or the information obtained by further 
correspondence. If the editing is done locally, many of these 
difficulties may be eliminated by telephoning. 

The principal task in editing is to locate all incomplete, incon- 
sistent, or improbable and impossible answers. When such 
answers are found, it is necessary either to discard the defective 
schedules or to obtain correct replies through further inquiry. 
This does not, of course, imply the elimination of “unexpected” 
replies. An incomplete answer, for example, would be if pneu- 
monia is given as the cause of death; it is necessary to know 
whether it is bronchial or lobar pneumonia. An inconsistent 
answer, for example, would be if a return shows a person widowed 
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when from his age it is clear that he never could have been. 
married, If a person who is a male is reported having died of a 
disease that is known to occur only in females, this is an impos- 
sible answer. There is somewhat less distinct a line between 

improbable and simple unexpected replies. å 
Only after incomplete, inconsistent, or improbable and impos- 
sible replies have been completed or corrected and all unclear 
figures carefully clarified are the schedules ready for coding, 
classification, and tabulation. For elaborate undertakings like 
the census, instructions are printed not only for the guidance 
of the enumerators but also for the editing and coding of the 
returns. For example, it is pointed out that the examination 
for completeness and consistency should be made family by 
family and not line by line. It will be easier to follow the entries 
belonging to the family if a strip of cardboard is placed across 
the schedule just under the line containing the entries for the 
last member of the family. The coding and editing instructions 
say that all corrections and code figures entered on the schedule 
by the coding clerks should be made with red ink and a medium- 
point pen (neither a stub nor an extremely fine pen). Such a 
detailed instruction as this is necessary in order to secure uni- 
formity and when tabulation is undertaken will enormously 

facilitate the work of the card-punching operators. 


CODING 


Whether or not machine tabulation is used, the coding of the 
schedules is a measure for economizing time. When large 
amounts of data are involved, consistent classification is enor- 
mously simplified by the use of code numbers, In arranging 
data it is then necessary only to observe a code number con- 
spicuously and uniformly placed on the return instead of reading 
a title and remembering to what class that title belongs. On 
a Works Progress Administration project to construct indexes of 
manufacturing employment and pay rolls in the state of New 
Jersey, 1923-1940, it was not possible to obtain the use of tabu- 
lating machines. It was found necessary, nevertheless, to use a 
carefully worked out coding procedure to avoid hopeless con- 
fusion in the handling of the data, which came monthly from 


1 Cf. United States Bureau of the Census, Instruction Manuals on Coding, 
passim. . 
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several hundred reporting firms. When machine tabulation is 
used, the coding procedure is a necessary step; it will be noticed 
that on the schedules (see F igs. 1 to 8) columns are inserted for 
the code numbers or letters to represent the various types of 
information on the schedule. 

An Illustration of Coding. In the 1939 census of manufac- 
tures, the manufacturing industries in the United States were 
grouped into 20 groups, each with a number. Food and kindred 
products constitute group 1; its code number is 100. Lumber 
and timber basic products form group 5; its code number is 500. 
Chemicals and allied products are group 9; its code number is 900. 
All subgroups of industries in the food and kindred products 
classification have code numbers in the 100's; for example, 
beverages are numbered in the 180's—nonalcoholic beverages 
is 181; malt liquors 182, wines 184, and so on. Grain-mil! 
products are numbered in the 140’s—flour and other grain-mil! 
products is 141, cereal preparations 143, rice cleaning and 
polishing 144, and so on. Confectionery and related products 
are numbered in the 170's—chocolate and cocoa products is 172, 
chewing gum 173, and so on. Similarly, subgroups of industries 
in the chemicals and allied products classification have code 
numbers in the 900’s; for example, industrial-chemical industries 
are numbered in the 980’s—plastie materials is 982, explosives 
983, coal-tar products, erude and intermediate, 981, and so on.! 

The classifications adopted by the United States Bureau of the 
Census for the 1939 census of manufactures follow closely the 
suggestions made by the Technical Subcommittee on Industrial 
Classification composed of representatives of various govern- 
ment agencies.2 The Suggested classification of this subcom- 
mittee, designated the Standard Industrial Classification Code, 
was made according to the following principles:* 

1. The classification should conform to the existing structure 
of American industry. 


1 United States Bureau of the Census, “Industry Classifications for the 
Census of Manufactures, 1939,” Form 75. 

* Members of the subcommittee included representatives of the Depart- 
ment of Labor and Industry of New York State, the Federal Social Security 
Board, the Bureau of Internal Revenue, the Bureau of Labor Statistics, the 
Bureau of the Census, the United States Employment Service, and the 
Central Statistical Board. 

3 Central Statistical Board, May 10, 1938. 
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2. The reporting units to be classified are establishments. 
(An establishment is defined as a place of business. All persons 
working at the same location or place of business are classified 
in the same industry.) 

3. Each establishment is to be classified according to its 
major activity. 

4. Each industry group established must have significance 
from the standpoint of the number of establishments and 
employees involved, volume of business, employment and pay- 
roll fluctuations, and other important economic features. 


TABULATION 


When the schedules have been edited and coded they are 
ready for the operations of the card-punch machines, and the 
final machine tabulations are made from these punched cards. 
The information on each schedule is transferred in code to the 
punch cards. With a machine resembling a toy typewriter, 
operators punch holes or combinations of holes in the cards so 
that the electrically operated machinery for sorting and tabulating 
can automatically transfer the information to totals by any 
classification desired. The punch card somewhat resembles 
the music roll of an old-time player piano, and most of the 
operations through which it goes are mechanical and electrical. 

The 1930 census required the punching of 326,635,219 cards, 
which required an additional handling for verification. "These 
cards represented 2,000,000 pounds of paper and would make a 
belt reaching nearly twice around the world at the equator. 
Punching, tabulating, and related work were equivalent to the 
handling of 4,701,671,697 cards once. 

The Bureau of the Census has its own unit tabulating equip- 
ment. Some of these machines can digest 400 cards a minute. 
The unit machines were invented and developed within the 
Bureau by Herman Hollerith, who was employed in the Bureau 
and invented the first machine to tabulate the 1890 census. 
He is now known as the “father of machine tabulation," used 
throughout the world by governments and business to handle 
large statistical jobs. 


CHAPTER IIT 
SOURCES OF STATISTICS 


Primary and Secondary Sources. The original collector of 
data is their primary source. Generally speaking, data obtained 
from a primary source inspire greater confidence than the same 
data taken from a secondary source. The primary source is 
presumably the one sure place to find the exact definition of the 
units of observation involved. Subsequent reproductions oí 
the data may fail to reproduce this essential information and 
lead to a misunderstanding of the true meaning of the data. 

The United States Bureau of the Census is the primary source 
of population data, of census data in general, and of all the 
statistical data published by the United States Department of 
Commerce, for the Bureau of the Census is the data-gathering 
agency of the Department. The Bureau of Foreign and Domestic 
Commerce, on the other hand, is a large retailer of statistical 
data gathered not only from the records of the Bureau of the 
Census but also from numerous other governmental and non- 
governmental sources. While governmental publications are 
thus not uniformly primary sources, they are usually very 
careful to give exact reference to the primary sources and to 
define units adequately. 

In some cases, secondary sources may be better than the 
primary. Such is the case when experts presumably better 
qualified than the general run of statistical researchers have 
selected the good statistics from the poor ones in some primary 
source that may be either obscure and difficult to obtain or of a 
highly technical nature. Oceasionally a secondary source 
performs the valuable function of selecting data impartially 
from primary sources that are biased in one way or another. 
Sometimes it is necessary also to be on guard against bias in 
government sources. 1 

1 Hiyricks, A. F., “Statistical Bias in Primary Data and Publie Policy,” 


Journal of the American Statistical Association, Vol. 33 (1938), pp. 143-152. 
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Natural Sciences. After the development of the statistical 
theories of gases (Charles’s law, Boyle's law, Avogadro's law, 
the work of Gay-Lussac, and the like) the physical sciences and 
arts accumulated source materials of a statistical character, 
Beginning with the last quarter of the nineteenth century, 
biology and zoology also accumulated source materials of a 
statistical character when a group of English biologists concluded 
that mass observation was necessary for the successful solution 
of their problems. ! 

Nongovernmental Sources. Statistical data of the natural 
sciences consist to a large extent of hypothetico-observational 
or experimental data. The principal sources of these data are 
handbooks of the special fields of study and monographs written 
by scholars at the great centers of research. For example, 
sources of astronomical data are the observatories located in 
various places throughout the world. The sources of currently 
discovered data in biology, physics, and chemistry are the 
laboratories maintained by universities, by private business 
enterprisers, or by such institutions as the Smithsonian Institu- 
tion at Washington, D.C, 

Additional primary sources of statistics in the natural sciences 
are the several hundred technical journals, publications of the 
learned societies, trade journals, publications of commercial 
research organizations, college bulletins, and the publications of 
endowed research enterprises, Fortunately for those who desire 
to make use of them, the data currently accumulated in such 
sources are summarized or abstracted in publications that main- 
tain sections of their respective issues for the purpose. 

Statistical data for the natural sciences are also found in 
handbooks for the numerous special fields of study. For example, 
there are handbooks in medical entomology, physical therapy, 
geology, botany, experimental physics, and geophysies,* 

1 ANDERSON, O. N., “Statistical Method,” Encyclopaedia of the Social 
Sciences, 

* A partial list of such abstracting agencies is as follows: Science Abstracts, 
Abstracts of Geology, Abstracts of Bacteriology, Abstracts of Chemical 
Papers, Zentralblatt für Mathematik, Jahrbuch über die Fortschritte der 
Mathematik, Physikalische Berichte, and Biometrika, 

* Handbook of Physical Therapy (1939); Handbuch der allgemeinen Chemie, 
unter Mitwirkung vieler Fachlente (1918-1937); Handbuch der Experimental- 
Physik (1926-1935), 43 vols.; Handbook for Chemistry and Physics, 
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Governmental Sources. Sources of data in the natural sciences 
are enormously supplemented by governmental agencies, The 
government weather bureau supplies current and_ historical 
data important to many kinds of research in such natural sciences 
as botany, zoology, and geology. The Minerals Yearbook, 
published by the United States Department of the Interior, is a 
source of data for natural scientists, The Geological Survey 
is a source not only of geological data but of data on electrical 
power production and other information useful to engineers. 
Engineers also find that government agencies are sources of 
statistics on railroads, flood control, roads, and other similar 
subjects having to do with construction, 

Biologists find the chief source of modern vitality statistics 
of all sorts among the publications of governmental agencies. 
An important source of statistical data for medical men results 
from medical research recorded in the files of hospitals, some of 
which are governmentally operated. 

The quantity of statistical data relating directly to the natura! 
sciences is thus large, but the natural sciences in addition make 
extensive use of the highly organized mass of statistical data 
collected largely by social scientists, Scholars in the natural 
sciences frequently make use of Statistics concerning social and 
economic events. It is not at all uncommon for data concerning 
the behavior of human beings to enter into the calculations of 
engineers, physicists, and chemists engaged in practical business 
enterprise or pure research. Some illustrations were given in 
Chap. I. 

Social Sciences. Genesis of Statistical Sources. The increasing 
complexity of economie and social life has furnished the motive 
for the systematic marshaling of statistical data about human 
society; and, in addition, the dynamie quality of modern life 
makes it necessary to repeat statistical enumeration frequently in 
order to have knowledge of current facts and, what may be 
more important, knowledge of change. In the static conditions 
of earlier times one public fact-gathering enterprise could serve 
for years as a basis for judgments and for political and social 
action. Under modern dynamic conditions, this is not the case. 

In a democracy the timing of governmental action is dependent 
on the consent of the people, and that requires widespread 
knowledge of many economic and social facts and their inter- 
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pretation. If democracy is to preserve its high standards of 
achievement, its powers of expression in the face of tremendous 
forces that appeal to sentiment rather than to reasoned judgment, 
adequate factual information must be in the hands of the voters 
and of their governmental administrators and representatives 
in time for necessary action. Modern business enterprisers, 
too, faced with rapidly changing conditions, must lean more 
and more heavily on statistics to point the way to the solution 
of their problems, 

During a great national crisis, such as a severe depression or a 
war, the value of statistical data is enormously enhanced. In 
depression periods, published statistical data from governmental 
sources, which in retrospect appear to have been but a trickle in 
prosperity, swell to flood proportions. Modern war, moreover, 
as well as being a “war of supply,” “a war of machines,” or a“ war 
of production,” is a “war of statistics.” The fact that much 
of the increasing wartime volume of statistical data is confidential 
explains the apparent and deceptive appearance of fewer sta- 
tistics in wartime than in peacetime. During the Second World 
War the statistics published by the United States Bureau of the 
Census, for example, sharply decreased because its organization 
and equipment were almost fully employed doing wartime sta- 
tistical work, especially for such agencies as the War Production 
Board and the Office of Price Administration, 

So diligent have been the efforts to obtain current knowledge by 
means of statistics during the past fifty years that a vast source of 
raw material now exists, covering many fields of knowledge, 
Elementary acquaintance with these sources is essential to all 
those who hope to work in either the natural or the social sciences, 
Complete familiarity with sources of statistics can come only with 
long practice in their use. It would be futile to attempt to impart 
to the student this desirable familiarity by giving a complete 
description of all sources, 

The Pattern of Statistical Sources. “The student cannot hope to 
memorize the names of all sources of statistics; indeed, the 
attempt would not be useful, for the names change and new ones 
are added as time goes by. Comprehension of the pattern of 
development of statistical sources, however, will enable the stu- 
dent to become a scholar who, when confronted by a statistical 
problem, will have acquired a “statistical sense" that will guide 
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him to the appropriate sources. This presumption explains why 
the present chapter on sources is given an historical or a genetic 
setting. Let the names of all the statistical agencies be changed 
by the Second World War, and the study of the historical and 
genetic explanation of statistical sources will still help the student 
acquire that scholarly ability required to locate sources; he will 
have historical perspective to facilitate his prompt understanding 
of the postwar world of statistical sources. In any case, the 
period between the First and the Second World War will long 
continue to be one intensively studied by statisticians of coming 
generations. 

Tn the ensuing description of sources of statistics, which is 
presented in its historical or genetic aspects, governmental 
sources are given more space than nongovernmental sources, 
because the general statistician deals mostly with the former. 
While the specialized statistician must acquire detailed knowledge 
of sources in his special field, he also needs to be familiar witi 
governmental sources in his field. Governmental sources, more- 
over, are themselves one of the best guides to the successful use of 
nongovernmental sources, because many governmental agencies 
are secondary sources that give complete and very useful descrip- 
tions of the primary sources used. 

The motive underlying the gathering and publication of 
statistics by private enterprise has usually been the profit 
available through the sale of such statistical information to 
commercial, banking, and manufacturing or distributing enter- 
prises. In many instances these services emerged as incidental 
features of existing publications; an example is the increasing 
amount of statistics of all kinds published in newspapers and 
periodicals. In other instances the statistical feature was the 
original purpose of the publication; many trade journals are 
cases in point: 

The state and privately endowed universities of the nation are 
important sourees of statistical research, especially of a pioneering 
character, in all branches of knowledge—some being famous for 
certain fields of statistical work. 

During recent years one of the most striking aspects of “big 
business” development has been the maintenance of research 
organizations contributing to statistical knowledge, a fact that the 
public was not permitted to forget as it visited the 1939 World’s 
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Fair in New York and read newspaper and magazine wartime 
advertisements in the early 1940's, Some corporation-financed 
research organizations, primarily intended for profit making, have 
incidentally contributed in important ways to the advancement of 
scientific statisties in engineering, business, and the use of agri- 
cultural products. Most of the pioneering statistical research 
in agriculture, however, as well as in labor organization, wages, 
and the like, is done by governmental units or by the govern- 
mentally sponsored agricultural experiment stations connected 
with various state colleges or universities. 

The motive underlying governmental activity in the collection 
and publication of statisties has been to increase knowledge of 
facts so that administrators may adjust government action to the 
changing needs of a dynamic society, so that democratic repre- 
sentatives of the people may legislate more expeditiously and 
wisely, and so that the voters in a democracy may have the 
opportunity to know the facts. In recent years, a great expan- 
sion in the governmental activity of collecting and publishing 
statistics to aid business enterprise has occurred. In short, 
governmental statistical agencies assist both public and private 
economic planning. The large quantities of statistical informa- 
tion released by the Department of Labor and Department of 
Commerce are eagerly awaited by business enterprisers seeking to 
keep up to date in their methods, labor policies, coverage of 
potential markets, and knowledge of desirable sources of raw 
materials, True, their zeal in filling out the questionnaires that 
constitute the sources of the desired statistics sometimes falters, 
but on the whole businessmen recognize the truly cooperative 
character of the system of collecting and disseminating business 
statistics and stoically endure the barrage. 

As a consequence of the manner of their historical and genetic 
origin, therefore, modern statistical sources in the United States 
fit into a pattern that is more or less uniform among the various 
fields of knowledge. This pattern is roughly as follows: 


Research of private enterprisers: 

Individual enterprisers, Special monographs, articles, and other con- 
tributions are made by individuals and published under the sponsor- 
ship of universities, professional publications, and the like. 

Research associations. Quantities of statistical data are collected by 
research organizations, some of which are hired by corporate or 
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noncorporate “private enterprise” in the business ‘world, some con- 
nected with universities, and some independently endowed, 
Commercial sources, i.e., privately financed publications; 
These sources are in the business of collecting and publishing statistics 
as a profit-making enterprise; they include 
‘Trade journals. 
Commercial and financial periodicals and services. 
Official publications of the government: 
Federal or national governmental agencies. 
Local, i.e., state or municipal, governmental agencies. 


Guides to Sources of Statistics. If a trained professional 
librarian is available for consultation, he is the best informant on 
the subject of guides or handbooks to all general fields of research. 
However extensive may be the experience and training of the 
research scholar, he finds himself continually relying upon the 
local librarian, who makes a specialty of keeping posted on new 
developments with respect to handbooks and literary guides of all 
kinds. 

Guides to Nongovernmental Statistics. Practically every con- 
ceivable occurrence in the world of man or beast, in the heavens, 
on the ground, under the ground, on the sea, under the sea, or ix 
astronomical space holds an interest for some individual or group 
of individuals; either as a hobby or as a means of livelihood some 
individual or group of individuals is now and has for many years 
been collecting statistical facts about all these world events. The 
existing sources of statistics necessarily therefore appear at first 
glance to be an unwieldy mass; but, fortunately both for begin- 
ners and for practiced scholars, this mass has been for some time 
culled over and classified, indexed and cross-indexed by various 
types of handbooks, yearbooks, or guides of one sort or another. 

The general magazine indexes constitute one class of such 
guides; the principal ones are as follows: 


Agricultural Index. 

Education Index. 

Engineering Index. 

Industrial Arts Index. 

Public Affairs Information Service. 
Readers’ Guide to Periodical Literature. 


Such indexes or guides are compiled monthly and cumulated 
into annual volumes, and articles of a statistical character appear- 
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ing in a comprehensive variety of journals and trade magazines 
can be discovered by the intelligent use of these alphabetically 
arranged indexes. The above-listed indexes are not specifically 
organized as guides to statistical sources; their collective purpose 
is as broad in scope as all modern knowledge, but one of their 
varied uses is to serve as guides to sources of statistics. 

Indexes or handbooks specifically dedicated to serve as guides 
to sources of statisties do, however, exist in considerable number, 
In 1937 the United States Department of Commerce published 
“Sources of Current Trade Statistics” (Market Research Series 
13), which lists practically all current trade statisties by govern- 
mental and nongovernmental agencies; this handbook was 
designed for the use of manufacturers, distributors, financial 
institutions, advertising agencies, trade associations, bureaus of 
business research, and individuals engaged in research work. In 
1942 the United States Department of Commerce published a 
handbook entitled Trade and Professional Associations of the 
United States, which lists the sources of practically every conceiva- 
ble type of trade statistics compiled by nongovernmental agencies, 

In 1934 a scholarly attempt was made by Gerlof Verwey and 
D. C. Renooy to construct a manual of statistical sources under 
the title The Economist's Handbook; this book was published in 
Amsterdam, Holland, and a supplement appeared in 1937. It is 
a guide to statistical sources on economic subjects, covering 
Belgium, France, Germany, the Netherlands, Switzerland, the 
United Kingdom, and the United States. In the United States, 
D. H. Davenport and F. V. Scott were authors in 1937 of An Index 
to Business Indexes, a book containing information about the 
many indexes used in business, including the name of the compiler, 
description of the index, frequency of publication, period covered, 
and the name of the publication in which current data appear, 
In 1937 the Special Libraries Association published a handbook 
Guides to Business Facts and Figures in which Part ITI is an index 
of statistical sources of information. 

A multiple assortment of handbooks in various special fields 
serve as guides to statistics in each special field of knowledge, 
along with other purposes for which the handbooks are issued. 
For example, Management Handbook, Fliteraft, and Handbook 
of Accountants serve, in their respective fields, as guides to 
Statistical sources. 
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Often the purpose of a handbook or index of sources of sta- 
tistics is served by one of the numerous abstracts of statistical 
data. The Statistical Abstract of the United States, published 
annually by the United States Department of Commerce, is 
itself a source of statistics, but it is also an index to sources 
because at the head of or in a footnote to each table of data it 
records the primary source from which the data are obtained. 
Similarly, the World Almanac, which for 58 years has been pub- 
lished by the New York World or the New York World-Telegram, is 
itself a source of statistics and also a guide to sources for the same 
reason. 

Guides to Governmental Statistics. Many of the handbooks 
serving as guides to statistical sources compiled by nongovern- 
mental agencies, include also in their alphabetical indexes a large 
range of governmental statistical sources as well; but there are a 
number of important handbooks specifically intended to serve as 
guides to the maze of governmental sources of statistics. The 
best-known and most comprehensive guide is the United States 
Government Manual, published by the government. In 1938 the 
Central Statistical Board (later the Division of Statistical Stand- 
ards of the Bureau of the Budget) published a Directory of Federal 
Statistical Agencies. The Central Statistical Board was organized 
in 1933 in order to find some means of coordinating the various 
types of Federal statistics.’ The business of the Central Sta- 
tistical Board was to serve as an agency for the reorganization in 
collection, tabulation, and use of Federal statisties. It was hoped 
such an agency could help solve the problem of overlapping in 
statistical function, which caused unnecessary burdens upon 
respondents to questionnaires and which also resulted in ineffi- 
ciency in the utilization of statistical information. 

In response to a request by the President in a letter of May 16, 
1938, the Central Statistical Board made a report on the question 
as to whether or not it is possible to reduce the amount of duplica- 
tion in statistical reports. The board concluded that much 
could be done in the way of coordinating the gathering, tabula- 


1 [n the task of perfecting Federal statistics the government has received 
the advice of scientific professional associations. See American Statistical 
Association and the Social Science Research Council, Government Statistics: 
A Report of the Committee on Government Statistics and I nformation Services 
(1937). 
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tion, and presentat ion of Federal statistics; by such coordination, 
comparability in definition would bring about a great improve- 
ment in the efficiency of data collected. With reference to the 
reduction in the amount of duplication, however, the board 
concluded that a majority of the financial and other statistical 
reports and returns made by the publie to the Federal govern- 
ment are incidental to the administration of governmental 
functions; the statistics are a by-product of either administrative 
or control functions of the government, Consequently, the 
board recommended that the Federal statistical and reporting 
services should remain largely decentralized so that they may be 
associated with the respective governmental functions to which 
most of them specifically relate; but that there is a continuing 
need for a statistical coordinating agency, with a specially 
trained staff and with broad powers! One important result 
of the coordinating functions of the Central Statistical Board 
was the publication of a directory of federal statistical agencies, 
which has already been mentioned. 

A general guide to government publications, Anne Morris 
Boyd's United States Government Publications (1041), serves 
incidentally as a guide to governmental sources of statistics. 
This book also gives an analytical picture of the charactor and 
scope of government publications, The same may be said 
regarding Laurence F. Schmeckebier's Government Publications 
and Their Use (1939). 


RESEARCH OF PRIVATE ENTERPRISERS 


Individual Enterprise. Pioneers. In spite of the fact that 
Domesday Book was an eleventh-century product and that even 
earlier examples of governmental collection of statistics can bo 
cited, it remains true both historically and currently that the 
pioneer work of converting publie records into statisties ix non- 
governmental, The pioncers have been and are individuals. 
The father of modern vital statistics is John Graunt, who in the 
seventeenth century made statistical investigations that served 
as the basis for founding life insurance. Another seventeenth- 
century scholar, Sir William Petty, was the outstanding pioneer 
in developing statistics for the social sciences. Both these 

! Report of the Central Statistical Board, 76th Congress, Ist Session, 
House Document 27, Jan. 10, 1939. 
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men were associated with the early development of the Royal 
Society of London, which was incorporated in 1662 and is the 
oldest of modern learned societies. 

Pioneering in the art as well as the science of statistics con- 
tinues in modern times to be highly individualistic. This is 
exemplified by the work of Karl Pearson in England’ and in the 
United States by such men as Wesley C. Mitchell and his works 
on index numbers and the business cycle, Warren Persons and 
his work on the statistical analysis of business statistics, and 
many others Individual contributions are commonly presented 
in the publications of learned societies, such as the Journal of the 
Royal Statistical Society, the Journal of the Statistical Society of 
London (founded in 1834), and the Journal of the American Statis- 
lical Association (founded in 1839). These and the publications 
of other learned societies are indexed in the guides mentioned 
earlier in this chapter. 

Research Associations. During the 1920’s and 1930’s a num- 
ber of important research organizations in the field of economics 
and social institutions were organized, The Brookings Institu- 
tion in Washington, D.C., the Harvard Committee on Economie 
Research, the National Industrial Conference Board, the Nitional 
Bureau of Economie Research, and the Cowles Commission 
were among the most prominent. 

The Harvard Committee on Economie Research was organized 
in 1919 to study business trends and cycles and to publish a 
scientific business forecaster; its work was launched under the 
leadership of Warren Persons. In addition to the forecasting 
service, this research organization publishes the Review of Eco- 
nomic Statistics (quarterly) and once or twice a year a summary 
of statisties called the Statistical Record. 

The National Industrial Conference Board was organized by a 
group of comparatively public-spirited manufacturers to study 
the various problems of employer-employee relationships, leading 
them into special studies of real wages, income distribution, and 
general economie conditions. It publishes its studies in the 
form of books appearing as they are written. In addition to 
the subjects mentioned above there have been National Indus- 
trial Conference Board books on cost of living, statistics of income 


}See Chaps, XIII and XIV. 
*See Chaps, XIX and XX. 
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by states, and availability of bank credit. The National Indus- 
trial Conference Board has also published since 1940 The Eco- 
nomic Almanac, which is a widely used annual. 

The National Bureau of Economie Research was founded in 
1920, sponsored by a group who believed that a purely dis- 
interested approach is desirable and that no group should control 
the findings of this new statistical organization. It is so con- 
stituted as to produce this desirable result. A number of special 
studies of economic and social conditions have been made and 
published under its auspices and some in cooperation with the 
government. For several years it has occasionally issued 
bulletins containing data resulting from studies that usually 
appear later in more detail in book form. 

The nature and accomplishments of the National Bureau of 
Economic Research are indicated by the following quotation from 
the twentieth annual report of the director of research ;' 


The National Bureau was established by men who believed that it is 
becoming possible to apply quantitative methods to the study of eco- 
nomic behavior. They realized that this field is far more difficult than 
the fields in which science has won its major triumphs and demonstrated 
its practical usefulness most conclusively, Also they recognized that 
investigators cannot experiment at will upon society; though society 
can and does experiment loosely upon itself. . . . Economies was not 
likely to grow faster at this turning point in its career: than its elder 
sisters [the natural sciences]. But at the close of the First World War 
the materials for observing actual behavior were multiplying so rapidly 
and analytic methods of extracting significant conclusions were becoming 
so versatile and powerful that our founders thought their staff had good 
prospects of rendering valuable service at once. Also they hoped that 
one modest success would lead to others, fostering cumulative growth 
of the kind that has characterized systematic research in other fields, . . . 

Twenty years of effort along the lines laid down in 1920 have con- 
firmed our faith in the social value of what the National Bureau set out 
todo. Our accomplishments have not been spectacular, but they have 
been substantial, and they afford a secure foundation on which to build 
in future. We have more reason than ever to believe that in trying to 
establish a few economic fundamentals firmly we are aiding thoughtful 
men of all persuasions to plan wisely. If tested knowledge is the safest 
and surest guide in practical affairs, our work has social meaning, how- 


iMrrengLL, Wester C, “The National Bureau's Social Function," 
March, 1940, pp. 13-15, 19. 
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ever technical its character. . . . We hold that advance will be rapid 
and continuous in proportion as the workings of our economic system 
are understood. In trying to replace speculative opinions about eco- 
nomic relations by conclusions resting upon evidence we are expediting 
progress in the most effective manner we know. 

.. . Another device, peculiar to the National Bureau, is to select 
directors who have divergent views on publie policy and give each an 
opportunity to criticize every manuscript. That device has been of 
inestimable help to us in keeping our reports nonpartisan and therefore 
worthy of eredence by the publie. Having such a board we cannot 
expect unanimous consent from its members to many policies that 
individuals among us favor. But the mere fact that the National 
Bureau never takes sides upon controversial issues adds its bit of pro- 
tection against bias in our publications and helps toward meriting and 
winning publie confidence. 


The more thoughtful sections of the publie we are now reaching in 
various ways. Physieal scientists are coming to recognize the con- 
tributions of research in economies; for example, in I Believe Robert A. 
Millikan says 

“Tn economics and the social sciences long and elaborate statistical 
studies must be made in order to eliminate the disturbing factors and 
thus obtain the controlled conditions. We are just beginning to have 
available, through the National Bureau of Economic Research and other 
similar agencies, a large amount of such definite, dependable, statistical 
knowledge in economies." 


The Twentieth Century Fund is another research association 
organized to funetion in a manner similar to that of the National 
Bureau of Economie Research. It publishes occasional pam- 
phlets or books. 


THE COMMERCIAL SOURCES 


In addition to the numerous sources of statistics resulting from 
individual or group research such as those described above, a 
great quantity of statistical sources has come into existence as 
the result of the activities of those who go into the business for 
the profit of collecting and selling statistical data. Such are the 
trade journals and the commercial and financial periodicals. 

Trade Journals. A large number of trade journals are actively 
engaged in collecting statistical data for various types of enter- 

‘prise. The Zron Age, for example, founded in 1855, is the trade 
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journal for the iron and steel industry, publishing statistics on 
iron and steel production in all.states and the prices of iron, steel, 
copper, zine, ete. Another example is Wileman's Brazilian 
Review, which is the trade journal for coffee. The trade journals 
are frequently used by governmental statistical organizations, 
such as the Bureau of Labor Statistics, the Department of 
Commerce, and the Board of Governors of the Federal Reserve 
System, as the primary sources of particular data assembled 
by them. Occasionally the trade journals will publish in special 
pamphlet form or in books assembled data of the trade. 

During the 1920's, a large expansion in the collection and 
publication of statistics in various lines of economic activity on 
physieal eommodity production and distribution took place. 
In a few instances this work was done by private companies, 
Thus Seidman and Seidman compiled data on furniture for the 
Grand Rapids district, and R. L. Polk and Company compiled 
data on new cars registered; the function of the latter was subse- 
quently taken over by Ward’s Automotive Reports. Many 
such series were compiled by the trade journals from public 
records. The Iron Age compiled data on physical quantities 
of production of pig iron, and the Statistical Sugar Trade Journal 
published quantitative sugar statistics. 

Trade Associations. Most of the production and distribution 
series are compiled by the various trade associations, such 
as the American Face Brick Association (merged with the 
Structural Clay Products Institute), the American Paper and 
Pulp Association, and the United States Cane Sugar Refiners’ 
Association. 

The production and distribution statistical series are of various 
types. Some measure the flow of commodities through the 
process of production and distribution, for example, data on 
raw material received or consumed, like the figures on cotton 
consumption by textile mills or on cattle receipts at stockyards. 
Others give a measurement of quantity or stock of a commodity 
on hand. Still others are figures on the amount of orders or 
sales of the product, such as the unfilled orders of the United 
States Steel Corporation. As noted elsewhere, many of these 
series are collected from their original sources and published 
by the United States Department of Commerce in the Survey 
of Current Business. Consequently, the appendix of the Survey 
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contains a description of about every important commercial 


source of statistics. In fact, the Department of Commerce , 


publishes a description of such statistical sources.' Frequently 
a trade association will publish a sort of handbook or abstract 
of statistics for the trade, covering historical as well as current 
statistics." 

Commercial and Financial Publications. The commercial 
and financial journals and services are also too numerous to 
mention in detail, but a few may be described as typieal. Among 
these are the Commercial and Financial Chronicle (weekly), the 
New York Journal of Commerce (daily), the Wall Street Journal? 
(daily), Bradstreet’s (merged in 1933 with Dun's), Babson's 
Reports, Moody's Investors! Service, Standard & Poor's Corpora- 
lion, Brookmire Economic Service, and the Dodge Statistical 
Service. 

While there is much overlapping of published commercial 
and financial statisties through these various publications and 
services, nevertheless each has become noted for especially good 
statistical service in a particular line. For example, the user of 
business-failure statisties thinks first of Bradstreet’s, because for 
many years the data that it has published on business failures 
have been widely used.  Bradstreet's was also famous for its 
index of wholesale prices for the United States, being a pioneer 
in the development and publication of such an index. Babson’s 
and Brookmire's services are noted for business forecasting and 
for investment services and forecasting the stock market. The 
New York Journal of Commerce is noted for its current data on 
new securities issued and on the produce markets. The New 
York Times is noted for its index of business activity, which 
was published in the Annalist (weekly) until that periodical 
was discontinued. The Commercial and Financial Chronicle is 
particularly useful for its detailed array of current data on bank 
clearings, business failures, interest rates, stock and bond prices, 
corporations, capital stock and bond issues, and the money 
markets of the world. This remarkable publication can be 


1 “ Sources of Current Trade Statistics,” Market Research Series 13 (1937). 

* United States Cane Sugar Refiners’ Association, Sugar Economics, Sta- 
tistics, and Documents (1938). 

* Often referred to in footnotes us Dow, Jones and Company, which some- 
times mystifies beginners. 
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traced in its lineage back to 1820, when it started as Niles’ 
Weekly Register, famous as an early preacher of the doctrines of 
high tariffs and the “American system." From 1839 to 1865 it 
was called Hunt’s Merchant's Magazine. Since 1865 it has 
gone under its present name. The financial statements of all 
kinds of corporations, together with other statistics and corporate 
histories, are to be found in Moody’s Manual of Corporations. 

The Commodity Yearbook is published by the Commodity 
Research Bureau, New York, N.Y, This is a private organiza- 
tion devoted to the dissemination of accurate information on 
commodities and other related subjects, including production, 
consumption, prices, stocks, imports, exports, etc. Some are 
annual, some are monthly data. 

All the above-described sources are extensively used by 
Aimerican and foreign business enterprisers, whose subscriptions 
to them and advertising in them make possible the vast statistical 
undertakings on a profitable basis, The fact that they are so 
supported would seem to prove the value of statistics to modern 
business enterprise, 
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Federal Statistical Agencies. Department of Commerce. The 
Department of Commerce is one of the greatest fact-gathering 
organizations of the Federal government, if not the greatest. 
It contains a number of bureaus chiefly engaged in the dissemina- 
tion of facts concerning not only commerce but economie and 
social life in general. The Bureau of the Census is the fact- 
gathering agency of the Department. 

The Articles of Confederation provided for the taking of a 
triennial census, but the Constitution of the United States 
provides for the taking of a population census every 10 years, 
to serve as the basis for Congressional apportionment. The 
first one was taken in 1790. The broad practical and scientific 
purposes that the census today serves were not in the minds of 
the American founders, and the earlier census publications were 
meager affairs compared with the modern census.' The census 
of 1790, for example, returned the number of free white males 
over, and the number under, sixteen years of age, the number of 

' CUMMING8, Jung, “Statistical Work of the Federal Government of the 


United States,” in Koren, History of Statistics, pp. 670-672. 
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free white females without distinction by age, all other free 
persons, and slaves—without, in the case of the last two classes, 
distinction by either sex or age. The published census of 1790 
consisted of a volume of 52 pages. At the census of 1800 and of 
1810, five age classes were distinguished and the age classification 
was extended to white females. In addition, at the census of 
1810, some facts were compiled relating to manufacturing estab- 
lishments, their number, nature, extent, situation, and value. 
A digest of the results of these data was prepared by Tench 
Coxe and published in 233 pages. The census of 1820 introduced 
the idea of collecting occupational statistics, calling for enumera- 
tions of persons engaged in agriculture, commerce, and manu- 
factures. The census of 1830 returned to the original idea of 
obtaining merely a population enumeration; but in 1838 Presi- 
dent Van Buren suggested to Congress in his annual message 
that the census should be extended so as to include “authentic 
statistical returns of the great interests specially entrusted to or 
necessarily effected by the legislation of Congress." As a 
result, Congress provided in the act for the Sixth Census (1840) 
that the marshals should “return in statistical tables . . . all 
such information in relation to mines, agriculture, commerce, 
manufactures, and schools, as will exhibit a full view of the 
pursuits, jndustry, education, and resources of the country.” 
Congress overreached the capacity of those entrusted with the 
task of census taking, for the census of 1840 is famous for its 
inaccuracies. At the census of 1850, improvements in the 
organization of collecting and compiling the statistics were 
made; and, according to Cummings, with the census of 1850 
the decennial enumeration began to assume modern proportions 
and character. 

One of the outstanding American economists of the nineteenth 
century, Francis A. Walker, was a pioneer in developing the 
census to what we understand it to be now. He did particularly 
notable work in perfecting the organization and presentation 
of statistical data in the Tenth Census (1880), of which he had 
charge. 

At the Eleventh Census (1890), machine tabulation was 
introduced (the Hollerith tabulating machines), at a great 


1 Ibid., p. 672. 
-2 Ibid., pp. 672-015. 
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saving of time and expense. The printed reports of the census 
of 1890 aggregated 21,410 pages, in 25 quarto volumes, the final 
report being issued in 1897. The Bureau of the Census was 
established as a permanent one in 1902 and since that time has 
been in continuous operation as a great fact-gathering organiza- 
tion for the national government. The tendency since that 
time has been to confine the decennial census to. the major 
subjects of population, manufacturing, agriculture, mines, and 
quarries, and in intervening years to take censuses of business. 
In intereensus years the Bureau also has charge of the annual 
coliection of mortality data, statisties on religious bodies, the 
collection and compilation of statistics of cotton and tobacco, 
and the annual compilation of statisties of cities of 30,000 popula- 
tion and over, and financial statistics of states.’ 

After 1902 the census of manufactures has been taken every 
5 years until 1919 and since 1919 every 2 years until 1939. The 
census of agriculture has been taken every 5 years since 1910. 
The Statistical Atlas, (containing graphic illustrations of much of 
the census data) was first issued in 1874 [based on the Ninth 
Census (1870)| and has appeared irregularly since that date. 

In 1929 a census of distribution as well as of manufactures 
was taken; but when the National Recovery Administration 
began operations, many of the data assembled in the census 
year 1930 were out of date owing to the sharp business recession 
and the increase of unemployment following that year. Along 
with the regular biennial census of manufactures for 1933 the 
Bureau of the Census undertook an extensive census of business 
of types other than manufacturing, such as amusements, service 
businesses, barbershops, beauty parlors, repair shops, and tourist 
camps, covering more than 2,400,000 individual establishments. 

1 By order of the Secretary of Commerce, the collecting of financial sta- 
tisties of states was discontinued temporarily after the 1931 report. With 
no comparative basis provided by the statistics for smaller cities and no 
individual reports for states, the remaining reports were of greatly reduced 
value. A detailed analysis was therefore made of the needs for data in this 
field and of the Bureau’s past and present inquiries. € ‘losely related reports 
were prepared for the director by the Central Statistical Board, the Advisory 
Jommittee to the Director of the Census, and the Municipal Finance 
Officers’ Association of the United States and Canada. Accordingly, the 
Division of Financial Statistics of States and Cities was reorganized in 1936. 
Annual Report of the United States Bureau of the Census, 1937, pp. 23-24; 
1938, pp. 28-29, 
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For subsequent biennial dates the census of business was 
further developed. The census of business covering the calendar 
year 1935, for example, was much broader in scope than either 
the census of distribution of 1929 or the census of American 
business for 1933. The 1935 census of business attempted to 
obtain a reasonably complete picture of essential and com- 
parable items of business information concerning practically all 
lines of business activity in the United States. It comprised 
a complete census of retail and wholesale trade, service businesses, 
amusement enterprises, hotels, broadcasting stations, advertising 
agencies, banking, insurance, real estate, bus transportation, 
trucking, warehousing, construction, and distribution of manu- 
facturer’s sales through primary channels. 

Elaborate care was exercised in preparing the 17 schedules; 
before final use they were submitted for criticism to representa- 
tives of the business groups and governmental agencies prin- 
cipally concerned. Special efforts were made by the Bureau to 
integrate the census of business and the biennial census of 
manufactures by the adoption of common definitions, instruc- 
tions, area designations, and field procedures. In order to 
perfect procedure, conferences were held to discuss schedules, 
procedures, and other problems inherent in such an expanded 
business census. These conferences were attended by representa- 
tives of trade associations, professional groups, chain-store 
organizations, etc., and by official representatives of a number 
of governmental agencies—the Central Statistical Board, 
Interstate Commerce Commission, Bureau of Foreign and 
Domestic Commerce, Tariff Commission, Federal Reserve 
Board, and Bureau of Labor Statistics.’ 

The population schedule for the census of 1940 is notable 
for a number of new questions concerning employment status, 
migration, income status, housing, and education. It is also 
notable for the innovation of the sampling technique applied 
to one group of questions in order to widen the scope of the 
inquiries. It dropped the question on literacy. 

Employment and unemployment queries have been made in 
previous censuses, but the 1940 census made a new approach. 
The new data permit classification of the nation’s labor force 


1 Annual Report of the United States Bureau of the Census, 1936, pp. 
19-21. 


SOURCES OF STATISTICS 75 


into the employed, the unemployed who have had previous work 
experience, and the unemployed without previous work experi- 
ence—new workers. They provide some measure of the volume 
of employment both during the whole year and during the week 
prior to the census day, Apr. 1, 1940. 

The schedule included questions that distinguish people at 
work, people unemployed who are seeking work, and people who 
have a job but are not at work because of temporary illness, 
industrial disputes, or vacations. Persons at work were asked to 
indicate the number of hours they worked during the week 
preceding the census, and the unemployed were asked to state 
the number of weeks they had been seeking work. Workers 
were classified as to whether they were in private industries or 
were employed by the government and whether they were own- 
account workers or unpaid family workers. 

rhe new inquiry on wages and salaries is important as a 
measure of national purchasing power and its distribution, and 
the resulting data have been helpful to business in indicating 
potential market areas. 

The net effects of internal population migration during the 
preceding 5 years were obtained by requesting the place of 
residence for each person as of Apr. 1, 1935. It is expected that 
compilation of the statistics comparing such residence with that 
of Apr. 1, 1940, which is also recorded on the schedule, will 
measure the effects of industry shifts, droughts, depressions, 
floods, the backflow west to east, and the shift from the city to 
the country, or vice versa. 

In 1940, for the first time, the decennial census included a 
separate housing schedule designed to give detailed information 
for each dwelling unit in the United States, whether occupied or 
vacant, rural or urban. Data were obtained as to the number of 
rooms, water supply, bath and toilet facilities, and light equip- 
ment, For each oceupied unit or household, information was 
obtained concerning the principal means of refrigeration used, 
the presence or absence of a radio, the character of the heating 
equipment, and the principal heating and cooking fuels used. 
Each residential structure was described in respect to single, 
double, or multiple family occupancy, whether or not it contained 
a business unit, for what purpose and in what year it was orig- 
inally built, the principal exterior material of the structure, and 
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whether it was in need of major repairs. The schedule included 
a question on whether the family leases or owns, whether there 
is mortgage indebtedness, and methods of home finance. 

It is expected that the compilation of these data will provide 
valuable information on the latent purchasing power of a com- 
munity. There is no more important index of the social and 
economic status of a population than the standard of its housing. 
Housing experts believe that the information gathered will be of 
inestimable value in determining future housing policies. It 
will be of especial interest to manufacturers, builders, distributors, 
and bankers in their study of trends in home ownership and 
pbuilding in the United States. Cities will be able to determine 
the distribution of the various types of housing within their 
limits, together with the possible need of expansion of transporta- 
tion and communication systems, police and fire protection, 
schools, and similar facilities. Data showing the equipment in 
houses, together with the state of repair of the homes, will be of 
value to manufacturers and distributors of housing products 
in the planning of their sales campaigns. 

The agricultural schedules for the census of 1940 likewise had 
a number of new features. Nine regional schedules, each used 
in a separate group of states, were especially designed to fit 
national variations in cropping practices. Questions designed 
to obtain subtotals for the value of various major categories of 
farm products sold or traded in 1939 made possible a much 
closer estimate of total farm income and of farm income by 
principal sources. The 1940 census also introduced a supple- 
mentary plantation schedule for use in the cotton belt that made 
possible a refined distinction between farms and plots cultivated 
by croppers and defined the exact status of each cropper and 
certain other tenants in relation to the plantation owner. Ques- 
tions to measure the effects of current agricultural policies were 
also asked, relating to soil-improvement crops, Summer fallow, 
crop failure, and succession or interplanted double cropping. 

The Bureau of Foreign and Domestic Commerce is the great 
Federal fact analyzer and fact publisher in the Department of 
Corimerce. It has a curious and rather complicated history. 
From the beginning of the national period, the statistics of 


1f. The New York Times, Jan. 24, 1940. 
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foreign commerce were linked up with our tariff policy and main- 
tained by the Treasury Department. In 1856, growing out of 
an investigation of the tariff policies of other countries by the 
State Department, there was created a Bureau of Foreign Com- 
merce as a permanent bureau for the purpose of collecting 
statisties on foreign trade. In 1866, the Bureau of Statistics 
of the Treasury Department was created to take special charge 
of this work, and at the same time Congress gave it power to 
collect statistics on domestic trade as well as on foreign trade. 
In 1905, a Bureau of Manufactures in the Department of Com- 
merce was organized to foster, promote, and develop the various 
manufacturing industries of the United States, and markets for 
the same at home and abroad, by gathering and publishing all 
lable and useful information concerning industries and 
markets., 

As a consequence, there were bureaus in three separate depart- 
ments (Treasury, State, and Commerce) concerned with the 
gathering of foreign-trade statistics. In 1912, however, these 
functions were centralized in the Bureau of Foreign and Domestic 
Commerce of the Department of Commerce. 

The most important statistical publications of this bureau 
are the monthly Survey of Current Business (with a weekly sup- 
plement) and the annual Statistical Abstract of the United States. 
Special publications, designed to aid business are also prepared, 
for example, historical studies of industries, studies of the national 
income produced, and studies of market data." 

Other bureaus of the Department of Commerce are the 
Bureaus of Fisheries, of Patents, and of Navigation and Steam- 
boat Inspection, each of which publishes specialized statistics. 
The two great statistical organizations in the Department of 
Commerce, however, are the Bureau of the Census and the 
Bureau of Foreign and Domestie Commerce. 

Department of Labor. The United States Department of Labor 
also contains bureaus that publish statistics, the most important 


av. 


1 Illustrations are P. W. Barker, Rubber Industry of the United States, 
1839-1939 (1939); Division of Economie Research, National Income in the 
United States, 1929-35 (1936); B. P. Haynes and G. R. Smith, Consumer 
Market Data Handbook (1939). For other statistical publications of the 
Bureau of Foreign and Domestic Commerce, see the United States Government 
Manual, 
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from the point of view of quantity of data compiled and pub- 
lished being the Bureau of Labor Statistics. This was created 
in 1884 as the Bureau of Labor, although the Treasury Bureau 
of Statistics created in 1866 had been enjoined to collect wage 
statistics. In 1888 the Bureau of Labor was made an inde- 
pendent Department of Labor. The duties of the Department: 
of Labor were to acquire and diffuse among the people of the 
United States useful information on subjects connected with 
labor, in the most general and comprehensive sense, and espe- 
cially on its relation to capital, the hours of labor, the earnings 
of laboring men and women, and the means of promoting their 
material, social, intellectual, and moral prosperity. The com- 
missioner of labor in charge of the Department was specially 
charged to investigate the causes of and facts relating to all 
controversies and disputes between employers and employees, 
and he was also empowered to make special studies of articles 
controlled by trusts and their effect on production and prices 
and other special subjects. Owing to the excelient work of the 
Department under the wise guidance of Carroll D. Wright, the 
first commissioner of labor, there is available a large mass of 
statistics in the field of labor for this country, including studies 
of strikes, the effect of the introduction of machinery on employ- 
ment and wages, the conditions of living and work of the laboring 
population, ete. Upon the basis of the wage and price data 
collected, index figures showing the trends of wages and prices, 
wholesale and retail, have been constructed and published by 
this bureau. 

In 1903, the old Department of Labor was transferred to the 
newly created Department of Commerce and Labor; but in 1913 
there was created a new Department of Labor, and in that 
department the Bureau of Labor Statistics. At the present 
time, the principal publications of the Bureau of Labor Statistics 
are the Monthly Labor Review (published since 1915), bulletins 
on special topics such as wholesale prices, retail prices, cost of 
living, wages, and labor turnover, and monthly serials to supple- 
ment the bulletins and give current information on those topics. 
Beginning in August, 1939, the Bureau of Labor Statistics pub- 
lished a daily index of 28 basic commodity prices at wholesale; 
but following the inauguration of wartime price controls this 
index was published only once a week since control in the raw- 
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material field was widely effective. During wartime the index 
was of little importance. ! 

Treasury Department. For the period before the Civil War 
the chief source of financial and price statistics in the United 
States, as well as data on governmental finance, consists in the 
finance reports of the Secretary of the Treasury. 

Before the development of statistical bureaus in the Depart- 
ment of Commerce and the Department of Labor, the Treasury 
Department was the most important source of Federal statistics; 
and it is still important in the fields of banking and monetary 
statisties, owing to the work of the comptroller of the currency, 
and in the field of income and Federal taxation and indebtedness, 
owing to the work of the commissioner of internal revenue and 
the Secretary of the Treasury. 

Vrom the United States Treasury Department comes the 
monthly Statement of the Public Debt of the United States. 
The commissioner of internal revenue of the "Treasury publishes 
an annual report of income-tax returns, constituting the most 
important source of data regarding income statistics in the 
United States. "The annual reports of the comptroller of the 
currency give financial and banking statistics and monetary 
data going back as far as the Civil War, when the national bank- 
ing system began. The comptroller publishes these data in 
àn annual report and also several times a year in the Abstract. of 
Condition of the National Banks. The annual reports of the 
director of the mint contain statistics on the produetion of the 
precious metals, including gold and silver. The Life Saving 
Service of the United States "Treasury Department publishes 
data on marine accidents. 

Interior Department, The Department of the Interior has 
important statistical aspects, too. The Bureau of Mines pub- 
lishes data on fatalities in coal mines. The Geologieal Survey 
publishes data on metal statistics and minerals. In the census 
years it has authority to collect statistics from primary sources. 
Since 1880 it has collected statistics carefully as to the crude 


1 For other statistics published by the Department of Labor see the United 
States Government Manual and see also Bureau of Labor Statistics, Selected 
List of Publications of the Bureau of Labor Statistics (1939), which can be 
purchased from the Government Printing Office. 

* See page 82 on the Federal Reserve System. 
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oil lifted from the ground, iron ore, etc., watching the physical 
consumption of our natural wealth. Tt also collects and pub- 
lishes statistics on electrical power production which are now 
considered useful in the study of the general trend of business, 
so important to business is the use of electricity. Other bureaus 
in the Department of the Interior are the Bureau of Education, 
Bureau of Pensions, and the Bureau of Indian Affairs, each 
publishing certain specialized statistics indicated by their titles. 
Department of Agriculture. The Department of Agriculture 
was not founded until 1862, but statistical work relating to agri- 
culture of a more or less systematic nature dates back to 1839, 
when Congress appropriated $1,000 out of the patent fund, to be 
expended under direction of the commissioner of patents, “in 
the collection of agricultural statistics, and for other agricultural 
purposes.” At the present time the great bulk of Federal 
statistics on agricultural matters is collected and published by 
the Bureau of Agricultural Economics, which originally was the 
Bureau of Statistics in the Department of Agriculture and later 
was known as the Bureau of Markets and Crop Estimates. In 
addition to a host of bulletins on special subjects related to 
agriculture, this bureau publishes a monthly report on weather 
conditions, Crops and Markets, and gives out estimates of annual 
crop yields. In recent years it has become the source of pioneer 
statistical work in the measurement of the factors influencing the 
demand for agricultural products and other similar statistical 
studies in connection with the conduct of the Agricultural Adjust- 
ment Administration. The agricultural yearbook, published by 
this Department, is a valuable record of agricultural progress in 
the United States and contains also extensive summaries of 
agricultural statistics. Since 1936 these summaries have been 
published separately under the title Agricultural Statistics. 
Current agricultural data are disseminated by the Department 
of Agriculture in its monthly publication, the Agricultural 
Situation. The Bureau of Agricultural Economies, which has 
direct charge of the above publications, also furnishes part of 
the program for the Farm and Home Hour on the radio, designed 
to distribute timely agricultural information to the farming 
population of the nation. 
` The administrative departments of the government thus eon- 
stitute sourees of statisties on a large scale, and statisticians 
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continually make use of these Federal sources of statistics. 
These publications of the government are available to everyone at 
very low cost and can be found for free use in most large libraries 
of the country or at offices maintained for the purpose by the 
government, 

The Independent Establishments. In addition to the adminis- 
trative departments of the national government there are many 
national commissions or boards or agencies, collectively described 
as the “independent establishments” of the government. Some 
of these have become well-known sources of statistical data in 
special fields. The principal ones are the Interstate Commerce 
Commission, the Federal Trade Commission, the Federal Security 
Ageney, the Federal Power Commission, the Federal Deposit 
insurance Corporation, the Securities and Exchange Commission, 
the Tariff Commission, the Maritime Commission, and the Board 
of Governors of the Federal Reserve System. 

The Interstate Commerce Commission was created in 1887 
as the Federal government's solution of the railroad problem, 
*c'owing detailed Congressional reports of the situation, known 
as the Windom Report (1873-1874) and the Cullom Report, 
(1886). These reports may be said to be the beginning of Federal 
lroad transportation and communication statistics. Since 
1887, such statistics have been gathered and published by the 
Interstate Commerce Commission, its powers having been gradu- 
ally extended to include other types of transportation, oil pipe 
lines, and express companies. In 1934 Congress created the 
Federal Communications Commission, which is devoted primarily 
to telephone, telegraph, cable, and radio. 

The Federal Trade Commission is the Federal source of data on 
the monopoly problem. In 1890 the Sherman Antitrust Act 
Was passed; and in 1903 Congress realized that there was need to 
collect facts to be used as a basis for the enforcement of the 
Sherman Act. At the urgent request of President Roosevelt, 
Congress created the Bureau of Corporations for the purpose of 
gathering data that would aid in the proper enforcement. Fol- 
lowing the passage of the Federal Trade Commission Act of 1914, 
the Bureau of Corporations was merged with the Commission, 
This Commission publishes reports on its investigations of various 
trusts, such as the investigation of coal, cotton, cereals, meat 
packing, and a number of others. During the 1920’s and 1930’s 
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it was a collector and publisher of statisties concerning trade 
associations and trade practices. 

The Board of Governors of the Federal Reserve System, which 
has operated since 1913, has become the greatest national source 
of statistics on banking and financial subjects. It publishes an 
annual report containing statistics on banking and related sub- 
jects, the Member Bank Call Report several times a year, and the 
Federal Reserve Bulletin, a monthly publication invaluable to 
bankers and statisticians working in banking subjects. In addi- 
tion, it publishes weekly mimeographed press releases on the 
condition of Federal reserve banks and of reporting member banks 
in order to make available more current data than is possible 
with the monthly or annual publications. In addition to financial 
and banking statistics the Board also has constructed through 
its Division of Research and Statisties an index of production 
calculated upon a comprehensive basis; this index and other 
special studies are also published in the annual reports and in the 
Federal Reserve Bulletin. 

The United States Tariff Commission, created in 1916, gathers 
statistics purporting to aid in the administration of the tariff 
laws and to help determine when duties should be raised or 
lowered. Owing to the strong influence of polities upon the 
question of the tariff, the studies of the Tariff Commission, with 
certain notable exceptions, constitute a great source of misuse 
of statistics. This was particularly true for the period from 
1920 to 1932 when most of its studies were for the purpose of proy- 
ing the need to raise tariffs. After the passage of the Reciprocal 
Trade Agreements Act in 1934 extensive improvements were 
inaugurated, and additional data were made available with the 
numerous studies that were conducted in cooperation with the 
State and other governmental departments. 

Finally, in connection with Federal statistics, it should be 
mentioned that frequently Congressional investigations result in 
the assembly and publication of valuable stat istical material often 
constituting original sources or at least original compilations of 
such material. Mention has already been made of the Windom 
Report in 1873-1874 and the Cullom Report in 1886, both on 
transportation, which led to the creation of the Interstate Com- 
merce Commission in 1887. Other examples are the Pujo 
Money Trust Report of 1913 and the various reports of the Senate 
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and House Committees on Banking and Currency during the 
1930's on brokers’ loans, branch banks, the operation of the 
national and Federal reserve banking systems, foreign loans, and 
stock-exchange practices. Important Federal legislation of that. 
decade was based on these investigations, 

Several noteworthy special commissions, created by Congress 
from time to time, have produced published documents that have 
become famous as great sources of primary statistical information. 
The Aldrich Reports from the Senate Committee on Finance, 
on Retail Prices and Wages (1892) and Wholesale Prices, Wages, 
and "Transportation (1893) constitute extensive compilations of 
price data covering a period of over fifty years. These reports 
have been extensively used as source material for statistical 
studies of priees and wages for the period 1850 to 1900. 

‘he Industrial Commission created by act of Congress of 
June 18, 1898, submitted a report to Congress in 1902, consisting 
ol 19 volumes and presenting a substantially complete epitome of 
the industrial life of the nation and of the important changes in 
business methods that occurred in the latter part of thenineteenth 
century. These volumes are largely statistical in their methods 
of description. The Immigration Commission, created in 1907, 
presented to Congress in 42 volumes a full inquiry into the sub- 
ject of immigration, reviewing statistically immigration to the 
United States during the period 1820 to 1910 and the com- 
ponent elements in our population as determined by immigration 
from 1850 to 1900. The National Monetary Commission, 
created in 1908, studied the banking and currency systems of 
the United States as compared with those of other countries. 
This Commission collected more complete statistical information 
with regard to the banks of foreign countries such as Great 
Britain, France, and Germany than had ever been collected 
before and for the first time in this country obtained compa- 
rable statistics for all banks in the United States. The full 
report of the Commission, consisting of 24 volumes, was com- 
pleted in 1912 and served as the basis of the bank-reform 
legislation known as the Federal Reserve Act. 

Other similar statistical studies in various fields of economic 
and social life have been made by commissions, such as those of 
the Select Committee on Wages and Prices established in 1910, 
the Commission on Industrial Relations created by an act of 
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1912, and the Commission on National Grants to Vocational 
Education, The Hoover Committees on Social Trends (1933) 
published extensive studies, partly statistical in character, of the 
evonomic and social life of the nation, . 

One of the most notable of such temporary organizations was 
the National Resources Planning Board, established in the 
executive office of the President of the United States under 
authority of the Reorganization Act of 1939. This Board 
succeeded the National Resources Committee, which had been 
established in 1935. Earlier names of the same organization 
were National Resources Board and Advisory Committee and 
National Resources Board, which was created in 1934 to succeed 
the planning organization of the Federal Emergency Administra 
tion of Public Works. When the United States Congress dis 
covered what it felt was an attempt by the executive to usurp 
Congressional powers by having an economic planning board, 
it became hostile to the National Resources Planning Board 
This hostility was not diminished when in 1943 the Board pre 
sented to the executive a plan for the postwar expansion of the 
Federal security program. President Roosevelt handed the 
report over to Congress for action, but the Board was abolished 
in that year when Congress refused to vote funds for its con- 
tinued existence. During the course of its checkered career, 
however, the Board became the author of several noteworthy 
statistical publications: Energy Resources and National Policy 
(1939), The Problems of a Changing Population (1938), Consumer 
Incomes in the United States (1938), Consumer Expenditures 
in the United States (1939), and The Structure of the American 
Economy (1939). 

State and Municipal Sources. ‘The activities of the various 
state governments result also in the compilation and publication 
of statistics. Most states maintain departments of institutions 
and agencies that, through supervision of reform schools, prisons, 
hospitals, and the like, become sources of statistics on mental 
and physical pathology, as well as delinquency. Data concern- 
ing the records of penal and charitable institutions, hospitals, 
and asylums for the insane and feeble-minded are primarily 
recorded by state or by municipal organizations. 

Vital statistics, that is, data relating to births and deaths and 
the classification of deaths by causes, have become an important 
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D 
part of the demographic work of municipalities and states and 
have thus made the state and municipal governments important 
primary sources of data of this character, In addition, statistics 
on marriage and divoree are recorded through state and munici- 
pul licensing administration, 

Data are recorded by states and regularly reported, based on 
their tax-collecting, licensing, and registration responsibilities, 
Vor example, statistical data result from automobile registration 
by states, 

State incorporation laws result in the accumulation of data, 
State incorporated banks and trust companies and building and 
loan associations, for example, are all regulated by the banking 
departments of the various states, and statistics regarding these 
tutions are regularly compiled and published by these 
«partments, Similarly, life insurance, fire insurance, automo- 
‘lc and casualty insurance, and workmen's compensation laws 
uul social-seeurity laws have resulted in state-rogulating bodies 
ul the compilation and publication of statistical data on 
^^neial, commercial, and industrial subjects, 

\ number of the larger and older of the industrial states have 
highly efficient labor departments, which compile and publish 
statistios of industrial conditions, Of increasing importance and 
interest to social scientists ix the development of the volume of 
"tatisties relating to industrial accidents and disenses, growing 
out of the need for such statistics in the administration of the 
workmen's compensation laws, 

The regulation of publie utilities and water companies and 
stroot-railway and bus companies by state and municipal authori- 
ties has made the public-utility commissions of the states the 
principal primary sources of statistical data on these important 
industries, although in the 1930's many of these data were 
Gathered by the Federal Power Commission and the Security 
and Exchange Commission, 

WORLD STATISTICS 

Under the League of Nations progress has been made in the 
collection and publication of world statistics. These are pub- 
lished in the Monthly Bulletin of Statistics of the League of ` 
Nations and also in its International Statistical Yearbook and 
its annual World Economic Survey. Statistics on world com- 
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mercial banking and finance were published in special League 
publications. Previous to the work of the League of Nations 
in this respect, the World Almanac had for many years been 
highly valued as a rough-and-ready source of a variety of world 
statistics and still constitutes a popular source. 

The Statesman’s Yearbook, published by Macmillan & Com- 
pany, Ltd., London, is a statistical and historical annual of the 
states of the world, giving data on population, area, finance, 
commerce, and banking, as well as figures on the fleets of the 
world and the world’s shipping. It has been issued annually 
since 1864. The United States government has always shown 
considerable interest in statistics of foreign countries and has 
published them along with the domestic data; but this practice 
has been far more systematic and thorough since the First World 
War. For example, the Federal Reserve Bulletin regularly 
publishes statisties of prices, banking, and currency conditions 
in the principal nations of the world; foreign price statistics are 
published by the Bureau of Labor Statistics in its special bulletins; 
and statisties on trade between other countries, that is, the trade 
of the world outside the United States and not with the United 
States, are published by the Department of Commerce in Vol. 2 of 
the Commerce Yearbook (as well as the statistics of our own foreign 
trade). In 1938 the Paris International Chamber of Commerce 
published a brochure on the economic statistics in 26 countries. 

In addition to such collections of statistics for all or a majority 
of the countries of the world, mention should be made of the 
sources, in greater detail than the world volumes, for statistics 
concerning three of the important countries of Europe. For 
England ‘and the Dominions, there is the Statistical Abstract for 
the British Empire, published by the Board of Trade. This 
combines what was previously published in the Statistical 
Abstract for the United Kingdom (first issued in 1864 for the years 
1840-1853) and the Statistical Abstract for the Several British Over- 
sea Dominions and Protectorates (first issued in 1864 for the years 
1850-1863). The French government publishes Annuaire statis- 
tique (1878) and the Bulletin de la statistique générale (1911). In 
Germany the official source of statistics is the Statistisches Jahrbuch 
fiir das deutsche Reich (1880). 

It has long been recognized that international statistics would 
be extremely important in obtaining true international, political, 
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and economic understanding and cooperation. Consequently, 
for many decades, efforts have been made to arrive at some sort 
of international understanding on methods to make the com- 
pilation of international statisties feasible or at least to improve 
existing world statistics. The statistics of each country are 
gathered according to the needs of that country; and since the 
problems in respective countries differ, so do the statistics. 
‘Their compilation and classification, according to varying 
definitions of units and varying bases of classification, produce 
startling differences in the final results. Then, too, the economic 
organizations of the various countries are different. A country 
with a large amount of transit trade and heavy reexportation 
of goods imported needs a different sort of classification of foreign 
trade statistics than a country doing little reexport business. 
Furthermore, the statistics themselves are gathered and organized 
in diverse ways in the various countries; the methods of collecting 
the statistical raw materials, the periods for which these data are 
gathered, and the methods of classification are not the same in 
the various countries. 

The endeavors made in the last eighty years for better inter- 
national statistical information, therefore, were first concentrated 
on the problem of rendering national statistics more comparable, 
since national statistics must be comparable between the various 
nations before they can be added up or compared to obtain 
international or world statistics. Quételet, the Belgian who did 
so much to organize comparable international astronomical 
observations, was likewise the first to try to solve the problem 
of obtaining the fundamental basis for better world and inter- 
national statistics. It is principally due to him that the First 
International Statistical Congress was organized in 1853 in 
Brussels, The main purpose of this Congress, the members of 
which attended in their private and not in their official capacity 
(although some were officials), was to bring about some degree 
of comparability in national statistics between the various 
nations. 

Another attempt to obtain international cooperation in 
statistical work was made in 1887 when the International Statis- 
tical Institute was formed. This organization, still in existence, 
elects members who are active in statistical work as professors, 
government officials, or members of private statistical offices. 
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The Institute cannot bind its members or the national govern- 
ments of its members but makes progress by suggesting improve- 
ments to different countries. 

The first official or semiofficial attempts for better world 
statistics were made in 1875 through the establishment of the 
International Bureau of the Universal Postal Union and the 
Bureau of the International Telecommunication Union (origi- 
nally called the International Bureau of the Telegraph Union). 
Both regularly gather statistics on postal and telegraphic develop- 
ments. Similar efforts in another field were made for the first 
time in 1882, by the International Congress for Hygiene and 
Demography. In 1905, another significant official attempt was 
made for greater comparability in world statistics. In that year, 
at the suggestion of the United States government, a meeting 
was held in Rome to formulate some plan for obtaining uniform- 
ity of agricultural statistics. This meeting led to the founding 
of the International Agricultural Institute, which still is active 
in the gathering of world statistics on agriculture, production, 
consumption, prices, and trade. The statistical information 
assembled by this body is published monthly and yearly and 
special publications are also issued. Sixty-two different coun- 
tries are members of the Institute. The Institute was very 
successful in putting national agricultural statistics on an 
internationally more comparable basis and in assembling regu- 
larly good and reliable world statistics on all fields of agriculture. 

Since the First World War, the League of Nations has been 
the natural organization to proceed with the work of interna- 
tionalizing statistics. Shortly after its establishment, the League 
started that work. At the International Economie Conference 
of 1927 the problem of comparable national statistics in order to 
secure good world statistics was studied. The League of Nations 
subsequently brought about an official meeting on the subject 
of international statistics and called an International Statistical 
Conference to meet in Geneva in November, 1928. The keynote 
of the Conference was that the general adoption of comparable 
international statistics was desirable for good international 
policies and in the interests of permanent world peace. The aim 
of the Conference was to bring about the broadening of the scope 
of national statistics in all countries where it seemed to be needed 
and to attempt to make national statistics in different countries 
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comparable. The Conference emphasized once more that such 
atlempts meet with many difficulties. Of the 42 countries repre- 
sented (some nonmembers of the League, like the United States, 
were also represented), only 29 countries felt they could sign the 
Convention and Protocol of the Conference. To induce that 
number to sign, it was necessary to limit greatly the program of 
work, 

Nevertheless, the Conference of 1928 did produce good results. 
A number of points were discussed, and important conclusions 
were reached, In addition, the Conference created a committee 
of technical experts to meet from time to time and make sugges- 
tions for further progress. This group met in March, 1931, and 
formulated a constitution for future work. It met again in 
December, 1933, to discuss problems of statistics on foreign 
trade. Up to the present time, its contribution to the solution 
of the problems involved has been inconsiderable, but it may 
make advances in this important work if the countries concerned 
will be willing to carry out the reeommendations made by it, as 
they are apparently committed to do by the Convention and 
Protocol of the Conference of 1928. 

In 1936 the twenty-third session of the International Institute 
of Statisties was held at Athens. At that session there were 
75 members, of which 10 were from North America. Twenty- 
seven countries designated official delegates. Also, the Secretary 
of the League of Nations, the International Labor Office, the 
International Institute of Intellectual Cooperation, the Inter- 
national Institute of Agriculture, and the International Chamber 
of Commerce were represented." 

In May, 1940, one of the 11 sections of the E ighth American 
Scientific Congress convened by the government of the United 
States in connection with the observance of the fiftieth anniver- 
sary of the founding of the Pan Ameriean Union was devoted to 
statisties. "The program of the section had the following broad 
objectives: (1) improvements in the comparability of official 


1Sruart, Pror, C. A. Verus, "La XXIIIéme session de l'institut 
international de statistique, Athènes, 1936," Revue de l'institul international 
de statistique, vol. 4 (1936), pp. 307—403. The citation includes the summary 
of resolutions of the session (pp. 378-395) and communications from various 
delegations on methods, legislation, organization, and administration of 
Statistics (pp. 396-403). 
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statistics among the American nations; (2) improvements in 
statistical methodology; (3) the furtherance of acquaintance 
among the statisticians of the American continent; (4) con- 
sideration by these statisticians of the possible development of a 
continuing professional medium for the interchange of statistical 
ideas and information. Correspondents in several of the 
American nations had pointed to the need for closer profes- 
sional collaboration among the statisticians of this hemisphere, 
and it was proposed to explore at this meeting the possibilities 
of establishing some kind of an inter-American statistical organi- 
zation of professional character. The result was the formation 
of the Inter-American Statistical Institute. 

A new quarterly, the Estadistica, published in Mexico, is the 
official organ of the Inter-American Statistical Institute, con- 
stituting one of its mediums for fostering statistical development 
in the Western Hemisphere. It endeavors to acquaint the 
persons in one country with statistical developments in other 
countries, to inform its readers concerning the availability of 
data, to present articles that will tend to encourage the adoption 
of improved methods, and hence to improve the quality of data. 
Articles may appear in any of the following four languages: 
Spanish, English, Portuguese, or French. An author’s sum- 
mary accompanies each article; the summary is reproduced in 
several languages. The Inter-American Statistical Institute 
also publishes a yearbook of statistics including statistical data 
for Latin-American countries and North America. 

Prospects to secure comparable world statistics and for inter- 
national statisties fluctuate with the rise and fall of isolationism 
and nationalism. Under the League of Nations and under the 
Pan American Union progress has been encouraged, only to be 

. hampered by ever-persistent isolationism in one country or 
another. Nevertheless, the need for comparable data with 
respect to all nations of the world has become more and more 
evident, it has come to be more and more appreciated as the 
problems have been studied by these various institutes, con- 
ferences, and committees, and more and more is it coming to 
be realized that such statistics are a pressing necessity to busi- 
nessmen with interests spread far and wide over the inter- 
national field. 
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While it has been stressed in this section that there are as yet 
no truly comparable international statistics, the student of 
international affairs and the international businessman will be 
able to obtain what constitutes for the present the closest 
approximation to them from a number of sources, chief among 
them the following: (1) International Statistical Yearbook (pub- 
lished by the League of Nations); (2) Vol. 2 of the Commerce 
Yearbook (published by the United States Department of 
Commerce); (3) The International Appendix to the Statistics 
Yearbook of Germany (Statistisches Jahrbuch fiir das deutsche 
Reich); (4) the Statesman’s Yearbook. The World Peace Founda- 
tion publishes also a subject index to the economie and financial 
documents of the League of Nations. 
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table by the text accompanying it and particularly by the title 
of the table. Briefly, the story of the table should be told in 
literary form in the text, reliance being placed on the table 


TABLE 2.—AVERAGE DISBURSEMENTS op Consumer Uxrrs! rN Eacn "Tun 
or Nation, 1935-1936. ‘ 


ursementa 
of fa and single Percentage of income 
individuals in 
Category of disbursement et Middle ES, 

Jl, | incomes ineomes Lower | Middle | Upper 
ünder of $1450 | third | third | third 
$780 | $190. | and 

» over 
Current consumption: 
Food 50.2 37.5) 21.7 
Housing, . 24.4) 18.5) 13.8 
Household operation. 54 108 240 11.4) 10.0 8.1 
Clothing 47 102 251 10.0 9.5 8.5 
Automobile... 16 5 215 3.3 5.33 7.2 
Medical care, 5 3 20 41 100 4.3 3.9 3.6 
Recreation ass see ete saa 9 28 89 18 2.0 3.0 
Furnishingg. s ih ce os eyelets 9 28 72 18 2.6 2.4 
Personal care. : hok 12 22 44 2.5 21 1.5 
Tobacco 1 By) leer 23 40 22 2.1) 1.4 
Transportation other than 
auto....... 11 19) 37 2.4 1.7 1.3 
Reading. ... 6 12) 28 1.3 Lä 0.8 
Education 2 7 20 0.5 0.0 (1.0 
Other items... 3 6 15 0.60 0.5 20.5 
All consumption items....| $550_ $1,056 82,212 116.7 98.1 74.8 
Gifts and personal taxes*...... Kg 39$ 181) 2.8 3.7, 6.1 
Savings. 7. ER Supe —92 —19} 506 —19.5 —1.8| 19.1 
All items. $471 $1,076$2,950 100.0 100.0 100.0 


1 Includes all families and single individuals, but excludes residente in institutional groups, 
* Taxes shown here include only personal income taxes, poll taxes, nnd certain personal 


property taxes, : h £ 
Source: National Resources Committee, Consumer Expenditures in the United States, 


Estimates for 1935-36 (1939), p. 40, 

merely as a dramatic summary. Simple devices to aid inter- 
pretation and facilitate the mental vision of the table have a 
useful place in special-purpose tables, such as accompanying 
relative figures, methods of emphasis such as italics, or the 
scheme of ruling the table. 
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E 
'The object of a special-purpose table may also be to compress 


into a small space a body of information “the narration of which 
in the text would be cumbersome and exhausting to the reader. 


Tapie 3,—Suare or Each Turrp or NATION'S CONSUMER Units! IN 
AGGREGATE DISBURSEMENTS, 1935-1936 


; z p Percentage of aggregate 
EE EE peon 
Category of disbursement ren CE 
„third, „third, | third, Tower Mida tapar 
incomes | incomes versi d | third | id 
$780 $1,450 | and over 
Current consumption: 
HOO trent eel tan ath ste $3,108$ 5,310$ 8,447 18.4) 31.5) 50.1 
EEN 1,515 2,621) 5,370) 15.9) 27.6) 56.5 
Household operation.,.. 703) 1 ,122 3,160| 13.3| 26.9 59.8 
GIOCHI Bein qe get voce 618) 1,388 3,305 11.7| 25.5| 62.8 
Automobile. ....... «6. -- 203 Kä 2,823, 5.4) 20.0 74.6 
Medical care. 264) 546 1,395) 12.0) 24.7) 603.3 
Recreation. , 115 362 1,166 7.0] 22.0| 71.0 
Furnishings. . 112) 368 942 7.9 25.9) 66.2 
Personal care...... 155 292 585; 15.1) 28.2 56.7 
SLO DACCO Ee 134, 301 531) 13.8 31.2 55.0 
Transportation other 
than auto..... Sae 150) 247 487 17.0 27.9 55.1 
Reading. . ; 84) 165, 302 15.3 29.9) 54.8 
Hducation,............. 30 87) 389 5.9 17.2) 76.9 
Other items............ 35 76) 196} 11.4) 24.6 64.0 
All consumption items.| $7,226 $13,890 $29, 098| 14.4) 27.7 57.9 
Gifts and personal taxes?,..| $ 1718 5168 ES 380 5.6 16.8 77.6 
EE |-1,207| —259| 7,437,-20.2| —4.2| 124.4 
All items. ........ TE 190814, 151838, 915) 10.4 23.9 65.7 


? Includes all families and single individuals, but excludes residents in institutional groups, 

?Taxes shown here include only personal income taxes, poll taxes, und certain personal 
property taxes. 

Souree: National Resources Committee, Consumer Expenditures in the United States, 
Estimates for 1935-36 (1939), p. 51. 


It is, in short, a method of condensation, and it is of the utmost 
importance that, as it tells so much in so small a compass, it 
-tell it as clearly as practicable."! 


1 FALKNER, Rouanp P., “Statistical Tabulation and Practice,” Journal 
of the American Statistical Association, vol. 11 (1916), pp. 192-200. 
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` 
Tables 2 to 6 are examples of special-purpose tables. They 
tell stories that are more or less hidden in the detailed but well- 


TABLE 4.—PERCENTAGE DISTRIBUTIONS op NoNnELIEF FAMILIES! IN Six 
Types or COMMUNITY, BY INCOME LeveL, 1935-1936 


ee 
Families living in 
Urban communities Rural communities 
Income level A yta Middl 
families Meron: Large sized | Small 
1,500,000 | Bn) age, | S'500" | None | p 
popula- | 1,500,000) 28:0007 | 25,000 | farms | Farm 
end popula- po pula- | Popula- 
ee tion on tion 
Under $250... . 2.8 1.7 2.0 2.4 3.1 3.0 3.8 
$250-$500..... 7.8 2.8 4.4 5.5 6.3 8.9 | 13.9 
$500-8750..... 11.3 5.2 7.6 9.4| 10.8] 11.8 18.0 
$750-81,000...| 13.4 8.5 10.5| 13.6 | 13.9| 14.4 16.6 
$1,000-81,250..| 13.2 10.9 12.4 13.9 14.6 14.0 12.8 
$1,250-$1,500..| 10.8 11.0 10.6 11.6 11.1 11.6 9.8 
$1,500-$1,750. . 9.1 10.8 | 10.0 9.7 9.4 9.1 7.0 
$1,750—-$2,000. . 7.8 9.7 9.0 8.5 7.8 6.5 4.8 
$2,000-$2,250..| 5.5] 7.9) Bäi 61] 58| 51] 8.1 
$2,250-$2,500. . 4.0 5.8 5.5 4.5 4.0 3.4 2.5 
$2,500-$3,000. . 5.2 8.5 T 5.4 5.3 4.4 2.9 
$3,000-$3,500. . 3.0 4.7 4.2 3.1 3.1 2.3 1.6 
$3,500-84,000. . 1.8 2.9 2.7 deg Kg 1.3 1.0 
$4,000-$4,500. . 1.0 1.7 1.6 1.0 0.8 0.8 0.5 
$4,500-$5,000. . 0.6 0.9 0.9 0.7 0.5 0.6 0.3 
$5,000-$7, 500. . 1.3 2.1 1.8 1.3 l.l 1.4 0.6 
$7,500-$10,000 0.8 1.6 dest 0.6 0.6 0.6 0.4 
$10,000 and | 
VBE v o4 1.1 3.3 Lo 1.0 0.6 0.8 0.4 
EM 
All levels. 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 


1 Excludes all families receiving any direct or work relief (however little) nt any time 
during year, 

! Metropolises of this size are in North Central Region only (New York, Chicago, Phila- 
delphia, and Detroit). 

3 Includes families living in communities with population under 2,500, and families living 
in the open country but not on farms. 

Source: National Resources Committee, Consumer Incomes in the United States, Their 
Distribution in 1935-30 (1938), pp. 24-25, 


organized statistics collected by means of the questionnaire 
referred to above. In order to simplify the data for presentation, 
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+ 
income levels are divided into three groups, lower third, middle 


third, and upper third. "These tables illustrate also the use of 
percentage figures to facilitate their interpretation. 


CHARTS 


Quick visualization of many rather complex situations can 
be readily achieved by merely looking at a simple chart. It is 
said that nowadays the first step toward using a series of data 
for any sort of analysis is to represent the figures by a line drawn 
onachart. So useful is the chart in giving a quick grasp of the 
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1940 1941 1942 1943 
Fic. 11.— Federal expenditures for war activities. (From data published in Daily 
Statement of the United States Treasury Department.) 


characteristics of data that it has been adopted in many popular 
books, in magazines, and in the financial section of metropolitan 
newspapers. Figures 11 and 12 illustrate dramatically the 
manner in which charts are used to aid in visualizing important 
developments during wartime. In peacetime the trends of 
data, even though less sensational, are watched with care, and 
charts greatly facilitate their analysis. 

The invention in 1786 of charting is claimed by William Play- 
fair, who set forth its advantages as follows: “As the eye is the 

1 The Commercial and Political Atlas (3d ed., London, 1801), p. x. Play- 
fair’s claim to be “actually the first who applied the principles of geometry 
to matters of Finanee” is made on pages viii and iz. Cited from W. C. 
Mitchell, Business Cycles—The Problem and Its Setting, p. 209. In An 
Enquiry into the Decline and Fall of Nations Playfair is said to have been the 
first to employ graphical devices in the treatment of sociological discussion. 
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best judge of proportion, being able to estimate it with more 
quickness and accuracy than any other of our organs, it follows, 
that wherever relative quantities are in question, a gradual 
increase or decrease of any . . . value is to be stated, this mode 
of representing it is peculiarly applicable; it gives a simple, 
accurate, and permanent idea, by giving form and shape to a 
number of separate ideas, which are otherwise abstract and 
unconnected.” 
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Tie. 12.— Production of munitions, including ships, planes, tanks, guns, ammu- 
nition, and all field equipment. (Data from War Production Board.) 


While the idea underlying the use of charts is quite old, the 
general use of charts for wide publie consumption is of much 
more recent origin and probably owes its present-day popularity 
to inventions having to do with the plating of charts for printing. 
From being largely a hand-labor process, the making of plates 
for the reproduction of charts has come in recent years to be a 
photoelectric process, with the result that today the most 
expensive part of the charts in a book, newspaper, or magazine 
article eonsists in the mental and hand labor involved in the 
original construction of the chart. 

There are five kinds of charts: (1) pietograms, (2) cartograms, 
(3) frequency curves, (4) bivariate charts, and (5) curves pictur- 
ing time series. 


“William Playfair was, one may say, the Sir William Petty of the Edinburgh 
group... ." Lancelot T. Hogben, Dangerous Thoughts (1939), p. 283. 
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Pictograms. There are four kinds of pictograms: (1) linear pic- 
tograms, in which the comparison is a linear one; (2) areal 
pictograms, in which the comparison is one of areas; (8) cubic 
pictograms, in which the comparison is one of cubes, or three- 
dimensional objects; and (4) sectors and circles, in which a circle 
is used to represent a whole and its various sectors are parts 
of the whole. 


OFFICERS' SALARIES 


TAXES AND 
LICENSES 


RENTS, MISC. 


DEPRECIATION 
BOTTLES, CASES, 
TRUCKING 


Fig. 13.— Distribution of the milk dollar. (Data from The Milk Dollar, Milk 
Industry Foundation.) 


The purpose of pictograms is to aid in rapid visualizing of 
coordinate comparisons of magnitudes. For example, a picto- 
gram might represent by a picture of a man the population of the 
United States, accompanied by a picture of proportionately 
smaller men representing, respectively, the populations of 
France and Germany. Sometimes pietograms are used to aid 
in visualizing the proportional parts of a whole magnitude, or 
comparison of component parts, as where a dollar is shown divided 
into sectors, representing the way in which the “public dollar" 
is spent. Figure 13 is an illustration of a pictogram showing the 
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“milk dollar." "The small obscured piece at the top represents 
2.98 cents of profit for the New York City distributors. 

Areal and cubic comparisons are not frequently used because, 
instead of simplifying the comparison desired, they are likely 


ko, 14,—Area comparison. 


to confuse it. This is because the mind finds difficulty in 
quickly differentiating sizes of areas or of cubes. Figure 14 
shows two areas in the form of squares. One of these areas is 
actually one-half as large as the other; but, at first glance, 
it seems to be more than half as large. Consequently, if com- 


Fig, 15.—Cubie comparison. 


parison of two quantities is desired by charting, areal presenta- 
tion is not a desirable method of obtaining easy comprehension 
of the differences that it is desired to stress. 

The difficulty is inereased if the attempt is made to chart 
differences of magnitude by the use of cubes, for it is still more 
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difficult for the eye and mind to grasp geometric comparative 
magnitudes in three dimensions. This is shown in Fig. 15, 
which depicts two cubes, one of which is one-half as large as the 
other, though a first glance makes it appear to be two-thirds 
as large. For this reason the use of pictures for making com- 
parisons is not considered to be the best practice. For example, 
the presentation for quick visualization of different-sized men 
in uniform to represent the relative fighting strength of various 
countries or of different-sized battleships to represent the relative 
size of navies will confuse the interpretation that the eye and 
mind will give to the relative sizes compared, even though the 
relative size is given purely a linear setting in the actual drawing 
of the figures.! Only the height of the uniformed men may be 
varied, but this might lead to comically proportioned men and 
an illusion of armies of tall thin men vs. armies of short fat men. 
If the uniformed men are properly proportioned for their varying 
heights, this results in an areal comparison. 

Consequently, the most generally used types of pietogram 
are those involving merely linear comparisons and the use of 
purely abstract linear distances. Rows of soldiers, each soldier 
representing a specified number of men, may be used to advan- 
tage, however, the longer row representing the larger army. 
Similarly, large and small navies can properly be compared by 
rows of ships, each ship representing a specified tonnage of that, 
type of warship. Such pictograms are really linear comparisons 
as also are bar charts and sectors of circles. 

Bar Charts and Sectors of Circles. The use of bar charts 
and sectors of circles is widely practiced and finds its application 
whenever it is desired to compare two or more differing mag- 
nitudes with each other or to give quick visualization of com- 
ponent parts of a given magnitude. Extensive use of vertical 
or horizontal bars is made by the United States Bureau of the 
Census in the Statistical Atlas of the United States, one of which 
was issued in 1914 and another in 1924. In addition, many 
modern writings, especially in the fields of the social sciences, 
attempt to portray by charts the statistics it is desired to present 
for popular reading. 


1 Cf. Croxron, F. E., and Harotp STEIN, “Graphic Comparisons by Bars, 
Squares, Circles and Cubes,” Journal of the American Statistical Association, 
Vol. 27 (1932), pp. 54-60. 
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Figure 16 is a graphic portrayal of the budget expenditures 
of the Federal government, based upon legislation in effect in 
February, 1943, in which the blacked-out portion of the vertical 
bars reveals in a striking manner the expected increases from 
year to year in expenditures for war activities, 
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Fio. 16.—Budget expenditures of the Federal government, based upon legislation 
as of February, 1943. (The Budget of the United States Government.) 


The use of horizontal bars is illustrated in Fig. 17, which 
shows graphically the statistical data in Table 4. The differences 
between distribution of income among nonrelief families in 
metropolitan areas as compared with that among families on 
farms is seen at a glance, and a slight scrutiny of the bars brings 
out the less dramatie but clear differences in the distribution of 
income in small cities compared with that in the larger ones. 

Another government publication contains data, shown in 
Table 6, from which charts were drawn that illustrate the use 
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of the component-part bar chart. The data that appear in 
these tables are shown in component bars in Fig. 18, where 
the length of the bar is varied in aecordance with the income 
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PERCENT OF FAMILIES 

Fio. 17,—Income distributions of nonrelief families in six types of community, 
1935-1936, (Based on Table 4.) 

level. ‘This makes possible the visual comparison of the average 

total family expenditure at various income levels. For example 

at the income level of $2,000 to $2,500 the aggregate family 

expenditure averages a little over At the same time, 
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the amount spent for various purposes can be seen from the 


differently crosshatched parts of each bar. 


Throughout the 


bars one kind of crosshatching represents a specified kind of 
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Note: Taxes shown here include only personal income taxes, poll taxes, and certain personal property taxes 


Fia. 18.—Use of income by American families at different income levels, illus- 


trating the use of bar diagrams. 


(Based on Table 6.) 
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Note: Taxes shown here include only personal income taxes, poll taxes, and certain personal property taxes 


Fic. 19.—Percentage use of income by American families at different income 


levels, 1935-1936, illustrating the use of 100 per cent bar diagrams. 


on Table 6.) 


expenditure. 


(Based 


The second desirable comparison is still more 


quickly grasped by the use of 100 per cent component part bars, 
which is illustrated in Fig. 19. When such a chart is drawn, 
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it is always advisable to warn readers that 100 per cent bar 
charts are being used; in addition, the table of actual figures 
should be given for the actual figures are completely concealed 
in the relative figures if only the chart is given. It will be 
noticed that clever arrangement of crosshatching, placing con- 
trasting types adjacent to each other, aids greatly in the reading 
of the chart. Figures 20 and 21 are interesting uses of the bar 
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Tia. 20.— Variation in expenditures with income, illustrating the use of a cross- 
hatched zone diagram. [National Resources Committee, Consumer Expenditures 
in the United States, Estimates 1935-1936 (1939), pp. 165-166] 


chart, virtually in the form of zones, to show the distribution of 
the consumer food dollar on the assumption of four different 
total national income levels. The same data are shown in 
Fig. 21 in the form of a 100 per cent bar or zone chart. The use 
of the zone effect has the advantage of aiding the eye to make 
the principal indicated comparisons. 

There are many examples of the use of sectors of circles in 
the Statistical Atlas of the United States, census of 1920, and a 
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number in the publications of the census of 1930. Figure 22 is an 
example of a single circle divided into sectors representing 
component parts in the utilization of milk in the United States 
in 1929. As in the case of the component bar charts, so also 
in the case of sectors of a circle, it is possible to represent changes 
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Fic. 21.—Variation in percentages of various expenditures with income, 
illustrating the use of a 100 per cent crosshatched zone diagram. [National Re- 
sources Committee, Consumer Expenditures in the United States, Estimates 1935— 
1936 (1939), pp. 165-166.] 


from time to time in percentage components by the use of à 
series of circles. It is not advisable to use the sectors and 
circles as bars were used in Fig. 21, namely, to picture relative 
change and total change simultaneously. To do this with 
sectors and circles involves areal comparisons that are not 
grasped by the readers of the charts. In Fig. 23, which is 
presented to illustrate the use of sectors and circles, the attempt 
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has been made to show also such an areal comparison. While 
most people would see at a glance that the circle for the End of 
1938 is smaller than the circle for the End of 1930, presumably 
to indicate that the total United States long-term investments 
in foreign countries was smaller in 1938 than in 1930, few could 
see from the areal comparison of the circles how much smaller. 


ation of milk in the United States, 1929, illustrating the use 
s. Based on value. (Fifteenth Census of the United States, 
1930, Vol. 4, Agriculture.) 


Perhaps it is sufficient to have the smaller 1938 cirele call atten- 
tion to the fact and then assume that the reader will be led 
thereby to note the figures, which are shown in a separate table. 
But the figure shown in each sector of the circles is a component 
percentage and does not throw light on aggregate amount. 

For the purpose of showing graphically the component parts 
of a total, the split-bar chart is a promising new device. Figure 
24 illustrates its use to show the distribution of the consumer 
food dollar. Comparison between consumer dollars of varying 
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Fra. 23.—The United States’ long-term investments in foreign countries, end 
of 1930 and end of 1938, illustrating use of circles of different sizes. (Bureau of 
Foreign and Domestic Committee, “The Balance of International Payments of the 
United States in 1938," p. 49.) 
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Fic. 24— Distribution of consumer food dollar, 1935, illustrating use of a 
split-bar chart. [National Resources Commitice, The Structure of the American 


Economy, Part I, (1939), p. 68.] 
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purchasing power could be shown by differences in the over-all 
lengths of the bars used. 

Simple bar charts and sectors of circles, it will be noted, do 
not involve areal comparisons to depict component parts; the 
bars and sectors are areas, it is true, but it is not the areas that 
are compared. The comparisons are between the varying lengths 
of the bars, because the bars are of uniform width. Bars of 
varying widths would complicate the comparisons and make 
them areal. Moreover, sectors of circles do not involve areal 
comparisons, because the comparisons visualized are the ares 
cut on the circle by the angles from the center of the circle. 
The visual comparison is therefore a linear one. Everyone is 
used to estimating the size of a piece of pie. 

Cartograms. As the name indicates, cartograms are maps. 
Generally, outline maps are used and various devices employed to 
picture varying characteristics of different parts of the country. 
All are familiar with the colored maps that show the mountainous 
sections in brown shaded off to the green of the lowlands, the 
light brown in between being the higher plateaus, but not moun- 
tains. The same principle is used in a variety of ways to present: 
statistics regarding geographically classified characteristics 
of the country by the use of maps. These will be classified and 
briefly described and illustrated. 

Cartograms by Dots, or Points. Dots varying in size for 
different quantities are used in the first class of eartogram of 
this type. Because of the necessity of making areal comparisons, 
that is, of using different-sized circles, this type of cartogram is 
not widely employed and is considered not a good method of 
presenting subjects in cartogram form. An example in Fig. 25, 
however, shows clearly the areas of geographical concentration in 
1935 of wage earners in manufacturing industries of the United 
States. An attempt is made to facilitate the areal comparisons 
involved in this cartogram by supplying a key, in the lower right 
corner of the map. 

In the second kind of cartogram of this type dots of uniform 
size are used, each dot indicating an aggregate specified. When 
dots of this sort are used, they can be counted to figure out the 
total. Sometimes a dot is quarter or half shaded to indicate a 
quarter or half the amount of a full dot. When thus used, the 
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In Fig. 26, this is illustrated by the attempt to picture the 
volume of wholesale trade of the state of New York, compared 
with the rest of the country. The dots are so dense that it is 
hardly possible to count their number. While the general 
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1940, in various metropolitan areas. 
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picture of relative density is quickly visualized from such a map, 
this purpose can be better served by the use of the point-dot 
map. Another objection to the dot of uniform size map for 
this particular purpose is that it may convey the impression 
that the concentration of wholesale trade is over the whole 
state of New York, whereas it is known to be concentrated in the 
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Source: Rubber Red Book 


Ffo. 32.—Distribution of rubber manufacturing in three leading states in 1937, 
illustrating a dramatic use of a point-dot map. [Reproduced from Barker, P. W., 
and E. G. Holt, Rubber Industry of the United States, 1938-1939 (Bureau of Foreign 
and Domestic Commerce), p. 20.] 
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D 
metropolitan area of New York City. This conception is 
more clearly brought out by the use of the device of using large 
dots of varying size. 

The third type is the point-dot cartogram, in which each dot 
means a certain quantity, but the dots are so small that they 
cannot be conveniently counted. The significance lies in pre- 
senting the idea of relative density of dots. Figure 27 shows the 
concentration in the Southeast and the Northern Middle states 
of nonpar banks of the United States. 

Cartograms by Colors and Shades. Obviously, the same effect 
can be produced by the use of colors and shades as by the use of 
dots, but the former are expensive to reproduce in print and 
therefore are not extensively employed. The Statistical Alas 
of the eleventh and twelfth censuses of the United States 
contains numerous such cartograms. 

Cartograms by Crosshatching. Making comparisons relating to 
geographical location by crosshatching maps has increased in 
popularity during recent years. It is more effective than the 
method of dots and is cheaper than coloring and shading. 
Figure 28 makes it easy for the reader to visualize the variation 
in different parts of the United States in the proportion of 
mortgaged owner-operated farms paying rates of interest as 
high or higher than 6.5 per cent. Figure 29 shows at a glance 
the variation from state to state in the percentage increase in 
nonagricultural employment from 1940 to 1943. 

Figure 30 is an interesting experiment in the combined use 
of a map and bar chart to show variation in the percentage 
increase in manufacturing employment in various metropolitan 
areas from 1940 to 1943. Figure 31 shows the use of a map and 
bars to depict flow of freight traffic in the United States. In 
Fig. 32 the geographical concentration of the rubber-manu- 
facturing industry in three states of the United States is 
dramatically emphasized by showing outline maps of only those 
three states. 


CHAPTER V 
STATISTICS—A STUDY OF VARIATION 


Ubiquitous Variability. Only in the abstract sense is there 
such a thing as a fixed quantity; in all cases, with reference both 
to physical and to psychic things, practical quantitative expres- 
sions are variables. However fixed the true quantity may be, 
no human measuring device is capable of giving the exact 
quantity; hence, all measurements obtained are approximations. 
In both physical sciences and social sciences, the raw materials 
amenable to the techniques of statistics are quantitatively 
expressed variations. The methods of analysis are likely to be 
complex when the scientist is faced with complex variability. 
This fact for the social sciences is recognized in the following 
quotation: “The social scientist is limited by the fact that he 
does not deal with rational material but with the rational and 
irrational conduct of man. The host of variables which this 
fact introduces multiplies the obstacles to his work and sets 
limits to the applicability of results." 


USE OF SYMBOLS 


Simplification of the complex methods that need to be used in 
statistics is accomplished by the use of symbols. Because sym- 
bols are used for various purposes, beginners may have a natural 
psychological reaction unfavorable to the study of statistics. 
The uninitiated may be mystified and frightened away from the 
subject on account of the symbolic presentation. It is impor- 
tant, therefore, to realize that the symbols used in statistics are 
quite simple and that there are not very many of them. Fur- 
thermore, they are easily learned and remembered, as soon as 
their real purpose of simplification is understood. 

1 Fospicx, Raymonp, B., A Review for 1939— The Rockefeller Foundation, 
pp. 41-42. This foundation contributes extensively to the support of 


research in many scientific fields; for example, it contributes to such research 
organizations as the Brookings Institution and the National Bureau of 


Economie Research, diseussed in Chap. III. 
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The Variable X. The study of variation is the meat and 
bones of the craft. The variable X is not a new idea to anyone 
who has gone as far as a first course in algebra and who has on 
many occasions said, “Let X equal . . . .” Symbols enter into 
statistical analysis in only three ways: 

1. To represent variation in size with time; in such a case the 
data measuring the variable are designated “time series.” 

2. To represent variation in order of magnitude, from smallest 
to largest, or vice versa (if time is involved, it is disregarded, as 
the variable is rearranged or reclassified upon the basis of mag- 
nitude); in such a case the data measuring the variable are 
designated “frequency series.” 

3. To represent variation in quality or attribute (for example, 
occupation, geographical location, or race). 

In symbolic language, it is purely a matter of convention that 
the variable may be referred to as X oras Y oras Z. Ina given 
problem, if the nomenclature of X is assigned to a given variable, 
it is necessary to retain that symbol for that partieular variable 
throughout the problem. In the theory of statistics conventions 
have arisen as to the use of symbols; for example, variables are 
commonly designated by the letters at the end of the alphabet, 
while constants or known figures are designated by the letters 
at the beginning of the alphabet. 

One convention widely followed is to use a bar over a letter 
to designate the arithmetic mean, so that X; (read “X; bar") is 
the symbol for the mean of a series of X's. Another group of 
X's would be X; and their mean X;. The subscripts 7 and j, 
respectively, symbolize subgroups. For example, all the X's 
may refer to the I.Q.'s of college freshmen; X; refers to the T.Q.'s 
of male freshmen; and X; refers to the I.Q.'s of female freshmen. 
Accordingly, X; symbolizes the mean I.Q. of male freshmen, and 
X; symbolizes the mean I.Q. of female freshmen. It is then 
conventional to designate the mean of all the X's, both X; and 
X;, as X (called “X bar"). 

Another commonly used convention is to designate an esti- 
mated figure by a letter followed by prime. According to this 
convention if an estimate is made of the value of X (for example, 
the coming erop yield of wheat based upon reports to the United 
States Department of Agriculture), the estimate is symbolically 
designated X’. Similarly, if an estimate of X (the price of 
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wheat, for example) is made from information on supply and 
demand data, it is called X’. The small Greek letter sigma (z) 
is used to designate standard deviation.! A special estimate 
of the standard deviation is symbolized by č. 

It is à common practice to use certain other Greek letters to 
symbolize statistics. Accordingly, ui, us, us, . . . , Hn (the Greek 
letter mu) symbolize the series of statistics called “moments” 
about a mean of a sample. The symbol z refers to the constant 
3.1416. The symbols vı, və, . . . , v, (the Greek letter nu) refer 
to moments about an arbitrary origin. 

While the use of symbols has become fairly well standardized 
in some respects along the conventional lines indicated, complete 
uniformity and consistent systematization are far from realized. 
Even the simple conventions above enumerated are not uni- 
versally followed. Nevertheless, the student will find it an 
advantage to have his attention directed to these trends in 
symbolic representation. 


TIME SERIES 


Conventional Use of X and T to Symbolize Passage of Time. A 
convention in times-series analysis is that X is used to refer to the 
passage of time. T is also used for this purpose. It happens 
that the same symbol, X, is conventionally used in geometry, trig- 
onometry, and the like, to refer to the horizontal axis in a plane. 
The unification of these two conventions results in the convention 
in statistics that, in making graphs of statistical time series, the 
X-axis (the horizontal axis) is used to represent the passage of 
time. Thus, the passage of time may be indicated by a series of 
X’s: Xi, Xs Xa, .. , Xn, as shown in Fig. 33, where X refers 
to years 
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1940 1941 1942 1943 

xX Xs X; Gei 
Fra. 33. 


or as shown in Fig. 34 where X refers to months. 
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Xi Xs X; Xa 
x Fic. 34. 


1 For further discussion of the standard deviation, see Chap. VI. 
2See Chaps. XIX-XXIV. 
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As indicated, TOU Ts, .. . , T, may also represent the passage 
of time. 

Lower-case letters x and £ refer to deviation from the mean; 
that is, X— KX=z; T — T = L 

Where the Variable Fluctuates in Size with Time. When the 
statistician is dealing with a variable that fluctuates in size with 
the passage of time, he refers to this variable as Y. This is a 
convention; there is no logical reason for it except that he has 
already used the symbol X or T to refer to time and wants to 
have a different symbol for the variable being studied as it fluc- 
tuates through time. This situation is described in technical 
language by saying that the variable is a “function” of time, by 
which is meant merely that, as time passes, the variable fluctuates 
in magnitude, one way or another. The simple symbolic way of 
saying exactly the same thing (where X refers to time and Y 
refers to the variable) is 


Y = F(X) 


There is nothing mysterious to be read into this expression. It 
is merely a use, slightly different from the ordinary one, of the 
equality sign; and the whole expression means that Y is a func- 
tion of X, or the variable which is being studied is a function of 
time, meaning that it fluctuates with the passage of time. This 
may be illustrated by one or two examples, imaginary figures 
being used. 


Time Passes IN 1944 


The unit that constitutes the variable is the price of sugar per pound in 
the New York City market (average for the month of prevailing daily prices), 


X Gë 
X, January Y, 3 cents 
X, February Y, 2cents 
X; March Ys 4.3 cents 
Xı April Y, 5 cents 
Xs May , Ys 4 cents 
X; June Ys 2.8 cents 


Thus X; is the first unit of time (January), and Y; is the 
measurement of the variable Y at that time according to the 
designated unit of description; in other words, Y; is the price in 
January. Similarly, Ys is the price in February (X», or the 
second unit of time), and so on. 
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The unit of time may be the week, as where the unit tiat constitutes the 
variable is the amount of rainfall in inches in New York City per week. 


x Y. 
First week 0.1 inch 
Second week 4.0 inches 
Third week 0.3 inch 
Fourth week 0.7 inch 


Tn this illustration, X; refers to the first week, X; to the second 
week, ete., while Y; refers to the inches of rainfall in the first 
week, Ys to the inches of rainfall in the second week, ete. 


The unit of time may be the year, as where the unit that constitutes the 
variable is the net worth of a business enterprise on Jan. 1 of each year. 


A NE 
1936 $20,001.00 
1937 $28,546.00 
1938 $21,527.00 
1939 $20,250.00 
1940 $27,430.00 
1941 $35,240.00 


It is customary in geometry, trigonometry, etc., to let the 
vertical axis represent the Y variable; fluctuations in Y are shown 
by vertical distances. The unification of this custom with 
statistical presentation results in the convention that, when a 
graph is made of a variable that is a function of time, fluctuations 
in the Y variable are shown by vertical distances while time 
change is indicated along the X-axis, or horizontally. 

Figure 35, showing comparative changes in cash farm income, 
farm-mortgage debt, and value per acre of farm real estate for 
years 1910-1942, is an illustration of the graph of a time series. 

Careful Description of Units Involved. One or two matters 
concerning the units involved in time series should be noted. 
Sometimes the variable refers to an average value over a specified 
period of time; in the first illustration above, the average price 
of sugar per pound in New York City is an average over a period 
ofa month. In other instances, the variable refers to a total for 
a given period of time; in the second illustration above, the inches 
of rainfall are given by totals per week. In still other problems, 
the variable refers to a quantity at the beginning of a period of 
time or at the end of a period of time; in the third illustration, 
the net worth of a business enterprise on Jan. 1 of successive 
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years was used. In Fig. 35, cash farm income is in totals for 
calendar years, each year’s total being expressed as a percentage 
of the average 1910-1914 yearly income. Farm-mortgage debt 
is in amounts as of Jan. 1 each year, expressed as a percentage 
of the average 1910-1914 annual amounts. Value per acre of 
farm real estate is in amounts as of Mar. 1 each year, expressed 
as percentages of the average 1912-1914 annual amounts, 

It is important in connection with the study of time series to 
know exactly how the variable is being used. Of equal impor- 
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Fia. 35.—Cash farm income, farm-mortgage debt, and value per acre of farm real 
estate, index numbers, United States, 1910-1942, 


tance is it that exact indication of this should be given. Every 
good statistician invariably indicates either in titles of tables or 
in footnotes just what his variables mean. He should do this 
no matter how expert a statistician he is and no matter how clear, 
without such explanation, his work may seem to him. 
Cumulative and Noncumulative Data. Another important: 
matter is the difference between cumulative and noncumulative 
data in time series. The fundamental distinction between 
cumulative and noncumulative data is really the difference 
between data of “condition” and data of “change.” Cumula- 
tive data are the data of change. It is possible to add the data 
on weekly rainfall and thus obtain data on monthly rainfall 
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or yearly rainfall. Sales of a store by the week can be added to 
get sales by the month or by the year. Tt is possible to cumulate 
the number of births daily in order to get the total number of 
births per month or per year. Income and outgo figures are 
cumulative data. To those who have studied accounting, a 
convenient analogy is to the profit and loss statement—figures 
in the profit and loss statement in the main represent cumulative 
data. 

Noncumulative data are those describing a condition and are 
not subject to the additive treatment. The average price of 
sugar per week cannot be cumulated to obtain the average price 
of sugar per month or per year. It is necessary to resort to 
averaging. The daily figures on population cannot be added 

in order to get the monthly 


Ss population figures. A balance 
E of $3,000 in the bank in January 
£2 and of $5,000 in March do not 
SI E give you a balance of $8,000 for 
$5 the two months, These are 


na na m may items of condition and cannot be 
added. In order to obtain sig- 
nificant summary results in the 
case of noncumulative data over several periods of time, it is 
necessary to average rather than to add. 

The method of averaging is applicable, not only to the non- 
cumulative, but to the cumulative type of data. Tt is significant 
to speak of the average daily rainfall during a given month or 
year, or the average weekly rainfall during a given month or 
year, or the average weekly sales of a given year, ete. 

Another way of referring to a time series is to describe it as the 
situation in which a variable is classified according to the time 
of its occurrence. The basis of classification is time; and the 
most logical arrangement of the data in question is that basis. 
As will be seen, the data of a time series may be reclassified, for 
certain purposes, upon a different basis, and when this is done 
they no longer constitute a time series. 

Charting Time Series. When a time series is graphed, the 
X-axis is used to represent passage of time, while the Y-axis-is 
used to represent varying magnitudes. Thus, in Fig. 36, the 
points plotted would represent a magnitude equal to 2 in 1940, 


Fia. 36.— Chart of a time series. 
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equal to 1 in 1941, and equal to 3 in 1942, rising to 14 in 1943. 
It is conventional to represent time series by lines or curves 
connecting the plotted points. In graphic phraseology these 
lines may be drawn through the plotted points as polygons (e.g., 
Tig. 36), or the changes in direction may be curved. 

Two kinds of charts are in general use for the graphic presen- 
tation of time series: (1) arithmetic charts and (2) ratio charts, 


ARITHMETIC AND RATIO CHARTS 


Arithmetic Charts. The arithmetic chart pictures arithmetic 
changes in magnitude. For illustration, in Fig. 37 is shown a 
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variable magnitude represented by the line AA’, increasing by 1 
during each time interval. This produces a straight line. On 
such a scale any variable increasing at a constant rate would give 
a straight line; but any variable increasing at a constant relative 
rate would produce an ever-steeper curve. This is illustrated by 
BB’, which shows a magnitude doubling in each interval, that is, 
increasing at a constant rate. ; 

The significant comparison in such a chart is always with zero, 
and hence the zero line should invariably be included in the chart. 
Leaving out the scale between zero and the -point where the 
curve reaches its lowest point will give a deceptive appearance 
to the changes that occur. This is illustrated in Fig. 38, where 
P, is really 14 larger than P; (see scale) but appears in the figure 
to be twice as great because only part of the vertical scale is 
shown. 

An arithmetic chart may also be a graph of relative figures, 
in which change from time to time relative to some base is 
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pictured. Such a graph is Fig. 35. In this kind of graph, the 
base is usually called arbitrarily 100 per cent and the relative 
changes above and below that base are graphed as percentages 
of it. Figure 39 shows a magnitude at 105 in 1941 (5 per cent 
above the base), at 95 in 1942, at 90 in 1943, and at 105 again in 
“1944, It is an extensive practice to convert time series into 
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Fia. 39.—Chart of time series in relatives. 


relatives, using some partieular point in time as the base; and 
when such relative series or “indexes” (as they are sometimes 
called) are charted, the chart assumes the form indicated in 
Fig. 39. The point of departure for reading such a chart is the 
100 per cent line, which should be emphasized—the zero point 
does not have to be shown on such a chart. The relative chart 
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Fia. 40.—The percentage changes in the prices of 354 industrial stocks (1935— 
1939 = 100). (Survey of Current Business, Weekly Supplement, Apr. 29, 1943.) 


should not be confused with the case in which raw data are 
already in the percentage form and the zero per cent may be 
the significant point of departure rather than 100. Thus the 
raw data may be percentages of population paying income taxes 
in successive years. In such a case the zero line is important, 
the raw data themselves being in percentage figures. 
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Figure 40 is an illustration of a graph of a relative time series. 
It shows the month-to-month variation, compared with the 
average of monthly figures, 1935-1939, in the prices of 354 
industrial stocks; thus the average 1935-1939 equals 100 per cent. 

Ratio Charts. The second type of chart for graphing time 
series is the ratio chart, which is designed to picture relative rate 
of change. According to Wesley C. Mitchell, the idea of the 
ratio chart was introduced by Jevons in 1863-1865.' But the 
ratio chart did not come into general use until its advantages 
were explained by Prof. Irving Fisher and James A. Field, in 
1917.7 


1940 1941 1942 1943 1944 


Fic. 41.—Chart of relative growth. 


The great popularity in recent years of the ratio chart has 
been largely due to the fact that special graphing paper has 
been made for the purpose, the work of making such a chart being 
thus vastly simplified. : 

In the case of the arithmetic chart, equal rises on the chart 
per unit of time represent a constant rate of inerease—in the 
ease of the ratio chart, equal rises per unit of time represent 
a constant relative rate of increase. This is illustrated by the 
comparison of the left with the right scale in Fig. 41. This 
figure is a simple illustration showing a magnitude changing 
at the same relative rate, BB’, and a magnitude changing at 
a constant rate, AA’, both plotted on a ratio scale. The BB’ 


1 Mrrengtt, W. C., Business Cycles, p. 209. 

2 Fisuer, Irvine, “The ‘Ratio’ Chart for Plotting Statisties,” Publica- 
tions of the American Statistical Association, Vol. 15 (June, 1917), pp. 577— 
601; Frecp, James A. “Some Advantages of the Logarithmic Scale,” 
Journal of Political Economy, Vol. 25 (October, 1917), pp. 805-841. Cited 
from Mitchell, op. cit., p. 209. 
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magnitude doubles in each time period. The AA’ magnitude 
increases in each time period by the constant difference of 4. 

Notice the scale of logarithms at the right; which corresponds 
to the scale of natural numbers at the left. These logarithms 
are to the base 2. Thus, the log; of 64 is 6 because 2° = 64; 
logs of 32 is 5 because 25 = 32; etc. It is evident, of course, that 
while the scale at the left is in geometric progression the scale 
at the right is in arithmetical progression. "This is a character- 
istic of ratio paper. Ratio charts have no zero line, and there 
is no point of emphasis. ‘The attention is directed to the shape 
and fluctuations in the curve. In the case of the arithmetically 
ruled chart, growth at a constant difference is a straight line— 
the greater the difference, the steeper the line—but it is still a 
straight line if the difference is constant. In the case of the 
ratio chart, growth at a constant relative rate is a straight line— 
the greater the constant relative rate, the steeper the line—but 
it is still straight. 

On arithmetical paper, changes in differences produce curves 
or irregular lines. On ratio paper, changes in relative rates of 
change produce curves or irregular lines. The vertical scale of 
the arithmetical chart is an arithmetic progression. The vertical 
scale of the ratio chart is in geometric progression; but the 
logarithms of the natural scale on a ratio chart are in arithmetical 
progression. For this reason, the ratio chart is often called the 
semilogarithmic chart. *One method of plotting a ratio chart is to 
find the logarithms of the raw data and then plot the logarithms 
on arithmetically ruled paper. The results are the same as if 
the natural data were plotted on a ratio scale. The labor of 
looking up logarithms is avoided by having the scale made into 
a logarithmic one, upon which the plotting of natural data will 
produce the same effect as if the logarithms were found and 
plotted. This is shown in a very simple case in Fig. 41, in 
which the scale in logarithms is at the right and the scale in raw 
data units is at the left. As already explained above, the line 
BB’ represents a variable that increases at a constant relative 
rate, while the line AA’ represents a variable that increases by a 
constant quantity. In Fig. 41 the vertical distance between each 
of the scale markings on the left represents just double the 
absolute amount of the same vertical distance immediately 
below it and just half the absolute amount of the same vertical 
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distance immediately above it. In this figure the variable that 
doubles every year follows the straight line BB’ (it was a curve 
in Fig. 37). A variable that increases by the same aggregate 
amount each year and hence follows a straight line in Fig. 37 
would follow a curved path on a ratio chart, such as line AA’ of 
Fig. 41. 

Since the logarithm of the ratio between two quantities is equal 
to the difference between their logarithms, ratio paper can be 
easily “calibrated” by the use of a logarithmic scale. Thus, if 
equal vertical distances are taken to measure equal aggregate 
differences between logarithms, then these same vertical distances 
will represent equal relative distances (equal ratios) between the 
antilogarithms of the logarithmic scale. In Fig. 41, for example, 
the unit vertical distance is taken to be a unit difference between 
logarithms to the base 2, and the logarithmic scale on the right 
reads, 2, 3, 4, ete. Since the antilogarithm of a number to 
the base 2 is equal to 2 raised to the logs power, the antilogarithms 
of the logarithmic scale become 1, 4, 8, 16, ete, This is the 
scale shown on the left. It is evident that while the seale on 
the right is in arithmetic progression the scale on the left is in 
geometric progression. Accordingly, if paper is ruled so as to 
be in arithmetic progression with respect to some logarithmic 
seale but is marked or calibrated in terms of the antilogarithms 
of the logarithmic scale, any variable plotted on this paper in 
accordance with the antilogarithmie scing will indicate a con- 
stant rate of growth or decline wherever it traces out a straight 
line. 

Most ratio paper is ruled in accordance with a logarithmic 
seale to the base 10, since this is the base of common logarithms, 
An example of this kind of “semilogarithmic paper" (as it is 
often called because the vertical scale is logarithmic. while the 
horizontal seale is arithmetic) is shown in Vig, 42. The reason 
common logarithms are to the base 10 is that numbers are 
arranged upon a decimal system and, by taking the base 10 for 
logarithms, the integral part of the logarithm (characteristic) is a 
mere record of the position of the decimal point in the original 
number. The number 10 raised to the zero power is 1, and so the 
logarithm of 1 is zero; the number 10 raised to the second power 
is 100, and so the logarithm of 100 is 2; the number 10 raised to 
the third power is 1,000, and so the logarithm of 1,000 is 3; and 
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so on, indefinitely. Likewise any number between 1 and 10 will 
have a logarithm (to the base 10) whose characteristic is 0; any 
number between 10 and 100 will have a logarithm whose char- 
acteristic is 1; etc. The fractional part of a logarithm (its 
mantissa) is the same for all similar successions of similar digits. 
The fractional part of the logarithm to the base 10 for the number 
2 is the same as the fractional part of the logarithm for 20 or 
200 or 2,000, etc., namely, 0.3010; but the characteristic of the 
logarithm of 2 is 0, the characteristic of the logarithm of 20 is 1, 
the characteristic of the logarithm of 200 is 2, and so on. Thus, 
the entire logarithm of 2 is 0.3010; the entire logarithm of 20 is 
1.3010; the entire logarithm of 200 is 2.3010; etc. Hence, when 
the base of the logarithm is 10, logarithmic markings of —2, —1, 
0, 1, 2, 3, etc., represent antilogarithmie markings of 0.01, 0.1, 
1, 10, 100, 1,000, etc. 

Semilogarithmie paper to the base 10 is usually ruled to 
represent either one logarithmie unit and the fractional parts 
thereof, corresponding to equal tenths on the antilogarithmie 
scale (called ‘‘one-cycle paper"), or two logarithmie units and 
the fractional parts of each, corresponding to equal tenths on 
the antilogarithmie scale (called “two-cycle paper”), or three 
logarithmic units and the fractional parts of each, corresponding 
to equal tenths on the antilogarithmie scale (called “three-cycle 
paper"). All three of these types of logarithmie rulings are 
shown in the right part of Fig. 42. Since the logarithmic scale 
is in arithmetic progression, these rulings would be the same for 
any logarithm differing by one, two, or three units; they would 
apply to logarithms running from —2 to 0, as well as from 0 to 2. 
Thus the corresponding antilogarithmie scale can be selected 
by the statistician in accordance with his needs. If his data run 
from 2 to 800, for example, he would select three-cycle semi- 
logarithmic paper and make his scale as indicated on the left 
of Fig. 42. If his data ran from 200 to 80,000, he would also 
select three-cycle semilogarithmic paper and make his scale from 
100 (at the bottom) to 100,000 at the top. If his data ran from 
0.2 to 8, he would choose two-cycle semilogarithmic paper and 
make his scale from 0.1 (at the bottom) to 10 (at the top). 

Figure 42 is an illustration of a three-cycle ratio scale for the 
plotting of a time series by months for 6 years. The scale as 
drawn reads from 1 to 1,000, but it could be made to read from 
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10 to 10,000, or*from 100 to 100,000, ete. At the right of the 
figure are shown the three most generally used types of ratio 
seales, the three-cycle ratio scale, the two-cycle ratio scale, and 
the one-cyele ratio scale. If the extreme fluctuations of a time 
series are 60 and 3,000, it would be necessary to use three-cycle 
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Fig. 42.— Three-cycle semilogarithmic paper. 


paper; on the other hand, if the extreme fluctuations are 60 to 
500, it would be necessary to use only two-cycle paper. 

Figures 43 and 44 are intended to illustrate the advantages and 
disadvantages of the ratio chart. Figure 43 shows the com- 
parative growth of some famous cities of the United States on an 
arithmetic scale, and Fig. 44 shows the same data plotted on a 
ratio seale. These data are also shown in Table 7. It will be 
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noticed that on an arithmetic scale it is not possible to bring the 
New York City population growth curve into the pieture. On 
the ratio paper this is possible. Of course, on the arithmetically 
ruled paper New York City population could be plotted on a 
different scale; but then the arithmetic comparison between 
New York City and the other cities would be lost, since the 
height of the curve from the zero line is what counts in the com- 
parison on arithmetic paper. 
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Fra. 43.—Growth of certain cities in the United States (arithmetic scale). 


The advantage of the ratio chart is threefold. (1) It makes 
possible a quick answer to the question as to whether a magnitude 
is changing its rate of growth. (2) It clearly pictures the rela- 
tive significance of fluctuations—for example, arithmetic dif- 
ferences of small magnitudes appear as important as the same 
relative differences of large magnitudes. On an arithmetic chart 
the latter would appear much larger. If an arithmetic chart of 
almost any item of production in the United States, say from 
1800 to 1940 by years, is constructed, the fluctuations in the 
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curve for the earlier period will be minute, while the fluctuations 
in the curve for the latter part will loom very large. In such 
cases, the inclusion has therefore sometimes been reached that 
instability is greater now than formerly. Plotting the same data 
on ratio paper would in most cases show that the earlier fluctua- 
tions were relatively as great as or greater than the modern 
ones. (3) It facilitates comparisons between time series in order 
to detect correlation between them. 
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Fia. 44.— Growth of certain cities in the United States (logarithmic, or ratio, 
scale). 

'The disadvantage of the ratio chart is that it is not possible 
to make magnitude comparisons. For illustration, if the 
attempt were made to compare the actual size of Trenton, N.J., 
and New York City in 1930, an entirely incorrect, impression 
would be ereated— Trenton would appear from the ratio chart 
to be about half as large as New York City in 1940 if vertical 
distance were assumed to be magnitude. When the ratio chart 
is used, such magnitude comparisons must be made by the use 
of the raw figures themselves, which should always be given in a 
table of figures along with the chart. Ç 
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"TABLE 7.—PO0PULATION OF SPECIFIED CITIES IN THE UNITED STATES FROM 
EannrEsT. Census TO 1940 
(In thousands) 


Sensus 

date — N. J. | . H. D ‘City! 
1790, EE | gl = eee 49.4 
INTE | COPS eee 79.2 
1810 3.0 Ce edu eeh 119.7 
1820 3.9 Tse] 152.1 
1830 3.9 sor SE 242.3 
1840 4.0 id tie eaten 391.1 
1850 ` 6.5 9.7 696.1 
1860 17.2 9.3 | 1:9- | 11748 
1870 22.9 9.2 165.1 | 1,478.1 
1880 30.0 9.7 30.5 | 1,911.7 
1890 57.5 9.8 140.5 | 

1900 73.3 10.6 102.6 | 

1910 96.8 | 11.3 124.1 | 

1920 119.8. | 13.6 191.6 | 

1930 123.4 | 14.5 214.0 

ACTU) insert 14.8. ||. :228:8 


Source: Sixteenth Census of the United States, 1940, Vol. I, Population, pp. 32 and 660. 
1Refers to New York City and its boroughs as constituted in 1940, 


FREQUENCY SERIES 


Definition of a Frequency Series. A convenient arrangement 
of any set of data is a classification according to magnitude, 
that is, from smallest to largest. In the case of a time series, 
time seems to be the most logical and workable basis of classifica- 
tion, because it seems reasonable to view things as they occur in 
time. There is a rationality about such a procedure. But 
another aspect of data, unrelated to time, may be important. 
For example, how many different prices of sugar during a given 
week differed from the average price for that week, and in what 
respect did they differ, or from how wide a range of fluctuations 
in price during the week were the respective average weekly 
prices calculated? This particular aspect would have no 
reference to time, except as a matter of definition of the unit 
involved (one would not take prices of the third week in March 
to study the average price in the first week of March). When 
the arrangement of data according to time of occurrence is not 
significant, it seems rational to classify the data in a series from 
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smallest to largest. When this is done, the resulting series of 
data is called an “array.” 
Following is an example of an array:! 


AN Array or 10 CHILDREN IN THIRD GRADË, BY AGE 


Age, 

Years 
xX 7i 
X: 7; 
X; KE 
Xs 8 
X; 81 
X; 8; 
X; i 
Xs 9 
Xs 91 
Xi b 


Variable X arranged according to magnitude, where X — age 
of children in third grade, X; = age of youngest child, ete., 
until X1 is age of oldest child. 

The situation may be one where there are a number of children 
of each age, for example: 


AN ARRAY or 18 CHILDREN IN THIRD GRADE, BY AGE 


Age, 
Years 
d Xi 61 11 
Xs 64 
X, 6 P 
X. 01 
Fecri? | 
Xe i 
X; 7 
Xs 7 
sme | 
Xio 7 
Xu 7 
E P: 
Xu i 
Xu H 
Xi 7i k 
Xis 7i 
Xu 1-31 
Xis 8 m 


1 The particular magnitude here taken is “age.” Any other common 


characteristic could be taken as the magnitude for comparison of the chil- 
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From the array, it is noticed that there is 1 child 61 years old, 
and there are 2 children 6} years old, 3 children 64 years old, ete. 

Inasmuch as there are really only eight variations of the 
variable X, some of which occur more than once, the above-is 
more conveniently summarized as follows: 


NUMBER or CHILDREN OF SPECIFIED AGE, AMONG 18 CHILDREN IN THIRD 


GRADE 
Age, Years, of Children Number of Children 
in Third Grade of Specified Age 
X F 
bäi 61 1 Fy 
x, DÉI 2 P, 
"1] DÉI 3 P 
X. 7 4 F. 
X; 7i 3 Fs 
X« 7i b Fo 
X; 74 1 F: 
Xs 8 1 Fs 


18 = N = x 


This is called a ‘frequency series” or a “frequency distribu- 
tion”; the variable is listed in a column in the form of an array, 
and in a second column the frequencies of each variation are 
set down. It is merely a condensed form of the array and is 
particularly convenient, as may be readily imagined, when a 
large number of cases is studied. It will be noticed that a new 
symbol is introduced, but it is a very simple one and one that 
readily suggests itself. Fı refers to the number of times X; 
occurs, F, the number of times Xs occurs, ete. F stands in 
general for the frequency of occurrence of a variation; 18 is the 
total number of cases and is therefore the sum of the His, and 
thisis written SF. (Fit Fo + F + ° + F, = EF.) How- 
ever, & more general way to symbolize the total number of 
cases is to use a large N. Either EF or N could be used, but 
it is conventional in statistics to use N to represent ZP. This > 
is the capital Greek letter sigma, and it is always used in statistics 
to designate “sum” or “total of.” 

Nature of a Frequency Distribution and Illustration. The idea 
of the array and of the frequency distribution in its barest 


dren, for example, height or weight. The basis must be a common charac- 
teristic or attribute that is a variable magnitude capable of quantitative 
measurement. 
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. 
simplicity has been illustrated. From the example, it is seen 
that the frequency distribution is merely the commonplace and 
rational arrangement of a set of data in order of magnitude. 
As indicated elsewhere, this form of arrangement discloses a 
natural ordef that appears to persist in all things,! namely, 
that in a large number of observations of a common characteristic 
of a thing the following tendencies exist: 

1. A large number of frequencies cluster about a central 
magnitude or average, which occurs most frequently. 

2. Small variations above and below this central magnitude 
are numerous. 

3. Large variations are much less frequent. 

4. Extreme variations are rare. 

Following is an example of a frequency distribution showing 
the number of cities of 100,000 or more population that have 
specified death rates from puerperal causes: 

TABL6 8— MATERNA Morraurry in Cries or 100,000 on More 
Porunation IN mur Unrrep Srares, 1938 


Death Rates 
(Number per 1,000 


Live Births) Number of Cities 
A F 
l- 2 
2- 16 
3- 18 
4- 20 
5- 15 
6- 10 
Lë 1 
8- 6 
9- 0 
10 2 

93 


Source: Bureau of the Census, " Vital Statistics,” Special Reports, Vol, 9, No, 7 (Feb, 10, 
1940), pp. 125-126. 

The average maternity death rate for these 93 cities is 4.8 per 
1,000 live births. It will be noted that, instead of writing X;, 
Xs Xs, . . . , for each variant of the variable X, the symbol X 
is written at the head of the column, indicating that the column 
consists of Xi, Xs, Xa . . . X,. The symbol F is handled in a 
similar manner. Furthermore, in this illustration, class intervals 


‘See Chaps. VI and VII. 
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of 1 are used, which is signified by the dash after each of the 
numbers in the X column. This is because fractional rates are 
given in the source and not merely rounded numbers. For 
example, the death rate from puerperal causes in 1938 in the 
city of Akron, Ohio, was 4.0; in the city of Albany, N.Y., it was 
3.1; in the city of Atlanta, Ga., it was 44. Since the death 
rates are given to one decimal place, if class intervals were not 
used for the frequency table, it would require some hundred or 
more rows of figures to place the death rates in an array. The 
symbol for the class interval is 7. In this case, ç = 10 decimal 
units, or 1. The average 4.8 was calculated by assuming that 
cases in any class interval all had the value of the mid-point 
of the interval. 

Discrete and Continuous Frequency Series. A discrete fre- 
quency series is one in which the units of measurements are 
more or less fixed by the character of the data. The phenomena 
actually occur in such a manner that their variations in size 
proceed by distinct jumps or steps. The unit of measurement is 
fixed by this fact. An example of such a series is a frequency 
distribution of interest rates, in which the quoted variations 
in rates are likely to fluctuate by + or i per cent jumps and 
there are few if any intermediate variations. The variation 
in the range of the actual cases is consequently by distinet steps 
of i or 4 per cent. The variation throughout the range is not 
by infinitesimal amounts. The very character of the data 
determines the unit of measurement and its degree of refinement. 
Where variation proceeds in this manner, by diserete steps of 
considerable magnitude as compared with the whole range of 
variation, it is probably best not to use a class interval. If the 
number of different values of X that occur are too numerous for 
convenience, however, then the data may be grouped into class 
intervals. Great care should be employed in this case to see 
that the class intervals are chosen so that the possible values of 
X are placed in a balanced position throughout the intervals. 
For example, if values of X occur at 0, 2, 4, 6, 8, etc., then, if 
grouping is desired, a class interval of size 4 might be chosen 
running from 1 up to but not including 5, from 5 up to but not 
including 9, etc. These would balance the actual X values 

1 For a more complete discussion of the class interval and calculation of 
averages, see Chap. VII. 
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around the center of each interval. On the other hand, intervals 
of 4, running from 0 up to but not including 4, from 4 up to but 
not including 8, etc., would result in the actual X values occur- 
ring at the lower limit and middle of each interval, causing an 
upward bias if the eases are assumed to be concentrated at the 
mid-points of the intervals, as is usual If the discrete data vary 
by steps that are small in relation to the range of variation in the 
data (e.g., in steps of 1 cent over a range of $100), then the data 
might reasonably be treated as if they were continuous. 

A continuous series is one representing a phenomenon that 
varies by infinitesimal amounts. It may have the appearance 
on the statistical table of the same discreteness as the discrete 
series; but this is because the arbitrarily discrete character of 
the unit of measurement eclipses the actual continuous character 
of the data. In a continuous series the range of the interval 
is obtained by a process of testing and finding the one that 
appears best to smooth the data, following the general rules for 
determining the class interval discussed later Frequency 
series of all growth phenomena are of the continuous type. 
For example, the frequency distributions of weights or heights 
of people of some specified age are continuous in character. 
In passing from one height to another, the individual must 
necessarily pass through every minute difference between; and 
accordingly in measuring the heights of individuals at the same 
age (or of mature people) the variants will be by minute or 
infinitesimal differences. The units of measurement, however, 
will make them appear discrete in character. 

Charts of Frequency Distributions. A frequency table is the 
presentation of a series of variable magnitudes, usually arranged 
from smallest to largest, in such a manner as to record the fre- 
quencies of the different magnitudes. For purposes of graphing 
it is conventional to use the x-axis for the variable magnitude 
and the y-axis for the frequencies. For illustration, in Fig. 45, 
the x-axis shows the variations of magnitude (death rates from 
puerperal causes in 1938) and the y-axis the frequencies (the 
number of cities of 100,000 or more population) of those death 
rates—so that the points appearing from the left to the right 
signify the following: 

1 Cf. Chap. VII. 

? See Chap. VII. 
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e 
Death rates in 1938 from puerperal causes: 

2 cities have death rates between 1 and 2 per 1,000 live births. 
16 cities have death rates between 2 and 3 per 1,000 live births. 


6 cities have death rates between 8 and 9 per 1,000 live births. 
0 cities have death rates between 9 and 10 per 1,000 live births. 
2 cities have death rates between 10 and 11 per 1,000 live births. 


The points are plotted over the mid-points to indieate that 
the frequencies cover the class interval and not merely the 
rounded quantities shown on the scale. Accordingly, Fi, or 2, 


20 


Number of cities 


0 
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Death rates in 1938 from puerperal causes 
Fic. 45. 


is plotted directly over 1.5; F4, or 20, is plotted directly over 4.5; 
etc. It is easily seen from the figure that the peak of the fre- 
quencies is in the interval containing the average. Tt can also 
be seen that numerous small variations from the average occur, 
but large variations from the average are few in number—that is, 
the frequency polygon slopes rapidly downward on each side 
of the average where the frequency is highest. Variations of 
1 below average death rate (death rate of about 3.8) lie in the 
class interval having 18 cases; variations of 1 above average 
death rate (death rate of about 5.8) lie in the class interval having 
15 cases. Variations of 3 below and above average are much 
less frequent—only 2 cases are in the class interval containing 
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death rate 1.8, and only 4 cases are in the class interval containing 
death rate 7.8. 

Instead of a polygon to trace the direction of frequencies, 
the practice of using bars to depict frequency distributions is 
often followed. Figures 46 to 48 are illustrations of such graphs 
of frequency distributions. It is possible also to fit a curve to 
the points either by freehand or by mathematical means and 
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Fia, 46,— Distribution of 617 wholesale price items by percentage of price 
change, 1926-1929. (National Resources Committee, The Structure of the American 
Economy, Part 1, pp. 128 and 131.) 
thus describe graphically the frequency distribution by a curve, 
which is called a “frequency curve." 

In Figs. 46 to 48 it is interesting to compare the concentration 
of percentage changes in the three different periods, namely, 
1926-1929, when prices and economic activity were compara- 
tively stable; 1929-1932, when prices and economic activity 
were on the decline; and 1932-1937, when prices and economic 
activity were increasing. Figure 46, depicting the distribution 
of percentage price changes, 1926-1929, is quite symmetrical, 
and the slope on each side of the maximum frequency is rapid; 
the position of the mean (wholesale price index for all commodi- 


1 See Chap. VI. 
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ties, or —4.7) is close to midway between the two extreme ranges 
of the variable. In Figure 47, however, there is no such sym- 
metry. On the contrary, there is a piling up of cases in the 
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Fic. 47.—Distribution of 617 wholesale price items by percentage of price 


change, 1929-1932. (National Resources Committee, The Structure of the American 

Economy, Part Y, pp. 128 and 131.) 

negative direction so that the slope to the left of the maximum 

frequency is gradual while the slope to the right is parabolic; the 

distribution appears to have a tail in the negative direction. 
125 
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Fig, 48.— Distribution of 617 wholesale price items by percentage of price 
change, 1932-1937. (National Resources Committee, The Structure of the American 
Economy, Part I, pp. 128 and 131.) 

Figure 48, on the other hand, shows the opposite tendencies, 
with the appearance of a tail extending in the positive direction. 

Figures 49 and 50 illustrate the use of frequency curves in 
chemical studies. 
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Figures 51 and 52 are illustrations of the use of frequency 
histograms in biochemical studies. 

While the frequency distribution in Fig. 45 is in the form of a 
polygon, those of Figs. 46 to 48 and 51 and 52 are shown by 
outline bars. When a frequency distribution is drawn with bars, 
the graph is called a “histogram.” 
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Citrylidenecrotonaldehyde a. Citrylidenecrotonaldehydle b. 
——— y-lonylideneacetaldehyde a. ——— y-lonylideneacetaldehyde b. 
—-— Citrylidenecrotonaldehyde a. —-— Gitrylidenetrotonaldehyde b 
semicarbazone semicarbazone 
-- y-Tonylideneacetaldehyde a. en y-/onylideneacetaldehyde b 
semicarbazone semicarbazone 
Fra. 49. Fra. 50. 


Fros. 49 and 50.—Analysis of the semicarbazone, melting point 178-179°, 
proved it to be derived from an aldehyde CısH20. The position of its absorption 
maximum at 3250 A. and that of the free aldehyde (3150 A.) regenerated on 
hydrolysis with phthalic anhydride are in excellent agreement with the positions 
found for citrylidenecrotonaldehydes and their semicarbazones. [Burraclough, 
E., J. W. Batty, I. M. Heilbron, and W. E. Jones, “ Studies in the Polyene Series, 
Part I,” Journal of the Chemical Society (London), October, 1939, p. 1551.] 


Frequency Distribution Plotted on a Ratio Scale. At an earlier 
point in this chapter (page 131) the effect. of plotting a time series 
on a ratio scale (semilogarithmic paper) was discussed. For 
some purposes the use of similar paper for the plotting of a 
frequency series is desirable. Figure 53 shows the effect of 
plotting on a ratio scale the frequency distribution showing the 
number of cities having specified death rates from puerperal 
causes. The frequency distribution when plotted on the 
arithmetic scale as shown in Fig. 45 appears to be unsym- 
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metrical, with a steep slope on the left side and a gradual slope 
on the right side. The use of a ratio seale for the variable 
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Absolute Ca. excretion Differences between groups 
Fra. 51. Fia. 52. 

Fias. 51 and 52.—Showing distribution of daily Ca excretion for groups of rats. 
Figure 51 shows results of 792 determinations of urinary Ca (mg./100 g./24 hr.) 
under standard conditions. Figure 52 shows differences (283 values) between 
test- and arbitrarily selected control groups. In both cases the results correspond 
with a normal distribution. [Truszkowski, R., J. Blauth-Opienska, and J. 
Twanowska, “ Paratltyroid Hormone,” The Biochemical Journal (London), Vol. 33 
(1939), p. 1007.] 
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1 e 3 45618901 
Death rates (Ratio scale) 
Fre. 53.—Death rates in 1938 from puerperal causes. (Cf. Fig. 45.) 


magnitude (continuing the use of an arithmetic scale for the 
frequencies), as illustrated in Fig. 53, has reduced this contrast to 
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such an extent that the slopes on either side are almost the same 
and the frequency polygon appears to be almost symmetrical. 

.An interesting application of logarithmic frequency-dis- 
tribution analysis has recently been made in entomology,’ by 
C. B. Williams, who says: 


Mr. Yule shows that the frequency distribution of sentence length 
(i.e., number of words between successive full stops) is of the skew type 
and by comparing two different manuscripts . . . he is able to produce 
convincing mathematical evidence on the identity or otherwise of their 
authorship. . . . When I converted some of Yule’s tables into diagrams 
I was struck by their general resemblance to skew distributions with 
which I have recently been dealing in some entomological problems, 

. . which distributions, I found, became normal and symmetrical if 
the logarithm of the number was taken as a basis for subdivision into 
groups instead of the number itself. 


Taking the logarithm of the number as a basis for subdivision 
into groups instead of the number itself accomplished the same 
end as the plotting of the original groups on a logarithmic or 
ratio scale. 


GROWTH CURVES 


Not all curves shaped like frequency polygons or curves are 
in truth graphs of frequency distributions. Some growth curves 
assume shapes very similar to frequency curves.? Figure 54 
is an illustration of a growth curve, showing the increase in 
Chlorella vulgaris cultures over a period of hours. The two 
curves contrast the peak of growth for two different-sized inocu- 
lums; in both cases the rate of multiplication per cell varied 
inversely with the density of population, not only in the early 
stages of growth but throughout the growth period in each 
culture. 


BIVARIATE SERIES 


Bivariates are cross classifications of two variable charac- 
teristics possessed in common by the objects being studied. 
Graphs of bivariates are sometimes confused with frequency 


1 WILLIAMS, C. B., “A Note on the Statistical Analysis of Sentence-length 
as a Criterion of Literary Style,” Biometrika, Vol. 31 (1940), Parts III, IV, 


pp. 356-361. 
2 For other types of growth curves, see Chap. XX. 
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distributions because in some cases their shape resembles the 
frequeney curve. Charts of bivariates, however, may assume 
almost any shape, and the center of the distribution may have no 
more importance than any other part of it. Good examples of 
bivariate comparisons may be found among the great variety 
of vital events when they are related to the different ages by 
their frequency of occurrence. 

Table 9 and Fig. 55 present a set of such distributions. These 
are clearly bivariate comparisons. The z scale in these charts 
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Fra, 54.—Growth curve showing the rate of increase in population in Chlorella 


vulgaris cultures as a function of time. [Pratt, Robertson, “Influence of the 
Size of the Inoculum on the Growth of Chlorella Vulgaris in Freshly Prepared Culture 
Medium," American Journal of Botany, Vol. 27 (January, 1940), p. 54.] 

is the variation in age, from childhood to old age, representing a 
heterogeneous scale with respect to many vital events, such as 
susceptibility to certain types of disease, accident, etc. Differ- 
ence in age constitutes in itself an attribute introducing lack of 
homogeneity where such a reference is made of it. With refer- 
ence to many types of diseases, man at very tender ages and at 
old ages is a different being from man at middle life or in the 
prime of youth. Such bivariates have no reference to central 
tendencies—the matter of central tendencies is irrelevant. 
What is sought is a picture of the association between the two 
variables, and the very character of the data is such that there 
can be no expectation of a piling up of frequencies about one 
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Fig, 55,—Examples of bivariate comparisons, 
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average ór central tendency. Figure 55 is presented to show a 
number of examples of bivariate charts. It is readily seen that 
when the purpose is understood such charts are very useful as 
a method of picturing vital statistics; but merely because the 
shape of the two last examples resembles the frequency polygon 
it does not follow that these are true frequency distributions. 


TABLE 9.—DEATH Rares PER 100,000 POPULATION IN THE UNITED STATES, 
1929, FROM SPECIFIED Causes, BY AGE* 


Tuberculosis | Scarlet Cerebral Broncho- | Puerperal 

LOMECOS | ale | entes | see | "e 

0- 7.01 11.29 2.06 182.00 

5- 2.27 5.81 0.59 6.44 

10- 3.13 1.74 0.47 2.85 0.15 
15- 22.37 1.11 0.83 3.42 9.94 
20- 56.33 0.94 1.50 4.33 23.01 
25- 72.28 0.92 2.19 4.66 24.72 
30- 80.34 0.79 4.24 6.70 22.25 
35- 86.17 0.60 9.95 9.38 18.48 
40- 95.66 0.191 21.47 13.91 8.18 
45~ 101.08 0.18 45.22 16.47 0.99 
50- 100.32 0.28 85.37 22.38 

55- 105.27 0.09 170.99 29.77 

60- 102.63 0.17 286.15 46.03 

65- 114.62 0.23 506.14 77.05 

70- 00777 32 814.09 124.62 

755 110.39 1,323.92 238.82 

80- 2,015.65 445.22 

85- j 2,477.50 845.00 

90- 76.09 2,365.00 1,035.00 
SEH } 4,320.00 | 1,920.00 


* The rates in this table were calculated from data on total deaths by ages in the total 
registration area of the United States in 1929 according to the Bureau of the Census (1932), 
‘Thirteenth Annual Report on Mortality Statistics, 1929, pp. 196-197, 198-199; 202- 
203, 206-207, 210-211; and population of the United States by age groups as reported 
in the Abstract of the Census (1930), p. 183. In 1929, the death registration area of conti- 
nental United States included 95.7 per cent of the total population. 

f Rates in italics based on less than 10 deaths, 


The odd shapes that may be assumed by bivariate charts 
are shown by these illustrations. They may be U shaped; they 
may be J shaped; they may be S shaped; and, of course, they 
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may be shaped like an ordinary frequency distribution, but when 
they are this is a matter of coincidence, without significance.! 

Figure 56 is an illustration of a bivariate chart of data in the 
field of the natural sciences, which is shaped like a frequency 
curve and which even uses the word “frequency” in the title of 
one of its units, though it is not a frequency curve. It is a chart 
of a bivariate comparison—the amplitude in centimeters com- 
pared with frequency in cycles per second. 
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= 0.050 
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zu 213 215 211 
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Fra. 56.—Another bivariate comparison. [Clark, A. L., and L. Katz, “ Reso- 
nance Method for Measuring the Ratio of the Specific Heats of a Gas, C/C," Cana- 
dian Journal of Research, Vol. 18 (February, 1940), p. 30.] 


Figure 57 shows the relationship between inventories and 
shipments of all manufacturing industries in the United States 
and is a bivariate chart. The dotted line on the figure represents 
the average relationship of inventories to shipments based on 
the 2!5-year period from 1939 through the second quarter of 1941. 
Deviations from this relationship by the quarterly items were 
small during the base period, the expansion of inventories being 
generally in proportion to the expansion of shipments. In 
contrast, inventories increased phenomenally in relation to 
shipments during the latter half of 1941 and the first half of 

1 Cf. also such a type of frequency distribution as that described by 


Thomas V. Pearce, “An Unusual Frequency Distribution—the Term of 
Abortion," Biometrika, Vol. 22 (1930-1931), pp. 250-252. 
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1942. Protective buying replaced immediate production needs 
as a motive for much of the inventory accumulation during this 
second period, and stocks expanded far out of line with the indi- 
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Fic. 57.—4A third example of a bivariate comparison. [Source: Survey of Current 
Business, Vol. 23 (1943), pp.3-9.] 


cated requirements of production, assuming that the shipments 
give an indication of requirements for production. 


STATIC VARIATION AND DYNAMIC VARIATION 


In statistical analysis there are two general forms of variation. 
The static form of variation is that occurring at a given point 
in time or occurring in such a manner that time may be rationally 
regarded as irrelevant to the variation. Where the variations 
that occur are a function of time, however, the variation is 
dynamic and requires different methods of analysis. In the 
main, the methods of analysis of static variation center in the 
treatment of the frequency distribution, whereas the methods 
of analysis of dynamic variation call for a different application 
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of principles. The same fundamentals, however, are used in 
the analysis of dynamic variation or time series as those used 
for the analysis of frequency distributions; only the applications 
differ. 

Rational Frequency Distributions. A rational frequency dis- 
tribution is one in which that arrangement of the data is suggested 
by the nature of the matter observed. Such a frequency dis- 
tribution is rational also because the variability of a common 
characteristic is chosen as the basis of the particular classification 
and this basis remains comparable among the objects measured. 
Frequently, the same idea is expressed by saying that the data 
are homogeneous; thus a rational frequency distribution means 
one in which the variable is homogeneous. 

Homogeneity may be defined as the condition prerequisite to 
comparability of data with respect to the attribute or factor 
being considered. ‘The negative aspect of this condition is that 
attributes not being considered are judged unimportant for the 
purposes of the study in hand. The positive aspect of this 
condition is that attributes or factors judged important for the 
purpose of the study are taken into consideration. 

For example, if the attribute height of human beings is being 
considered, color of eyes may be judged irrelevant and therefore 
is not considered. But, for a homogeneous study, age, sex, and 
perhaps race are attributes that must be considered because they 
are all correlated with the attribute height and cannot therefore 
be judged unimportant in studying height. Unimportant 
attributes (those ignored) have zero correlation with the attri- 
bute studied. Attributes correlated with the attribute studied 
must be taken into consideration in order to obtain homogeneous 
data. In the example of heights, homogeneity is obtained by 
classification, that is, by taking heights of a particular class in 
which the correlated attributes are constants. Thus, heights of 
mature Caucasian males may be taken as one homogeneous 
group; another homogeneous group would be heights of mature 
Caucasian females; another would be sixteen-year-old Caucasian 
males; etc. 

An important result of homogeneity is that no particular cause 
of bias or cumulated variation is present. On the contrary, the 
causes of variation consist of many minute mutually uncorrelated 
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(or independent) causes of variation that occur according to the 
law of large numbers, in other words, in a random manner. 

Irrational Frequency Distributions. By disregarding the ele- 
ment, of time present in a time series, whose natural arrangement 
is according to time occurrence, the data may be reclassified and 
arranged in an array, or a frequency distribution. Such a 
rearrangement would conceal the natural time sequence originally 
present in time-series data when in their natural or rational order. 
This type of frequency distribution is irrational as a method of 
summarization. The multiple forces affecting variability in a 
time series are not usually operative at random or in a mutually 
independent manner. On the contrary, the causes of variation 
may and usually do form a cumulative series of mutually depend- 
ent variations. It is to be noted in passing that Figs. 46 to 48 
are not distributions of time series. In the data for each of these 
frequency distributions the attribute summarized is for a specified 
time and all the variables are for that specified time—thus time 
is held constant, and the variation shown in the histogram is 
uncorrelated with time. These are rational frequency dis- 
tributions. But as soon as the data are viewed in their dynamic 
aspect, that is to say, are correlated with time, the many biasing 
attributes or factors of time destroy homogeneity in the data.* 
For example, with respect to the price of sugar taken as a time 
series, the supply at a subsequent period might tend to be larger 
as a result of the relatively high price existing at the earlier 
period; and as a consequence the previous high price is a cause 
of a later lower price. The existence of a price situation, morever, 
at a given time may produce technological changes in the pro- 
duction and distribution of sugar that in turn will be a dominant 
factor in the determination of a subsequent price. 

In spite of the fact that the procedure of reclassifying time 
series and arranging the data in frequency distributions is 
irrational, it has legitimate uses in statistical analysis. There 
is a place for irrational procedure in the progressive development, 
of knowledge; but, when used, the user should be conscious of the 
irrationality involved.? 


1 For careful consideration of the law of large numbers, see Chaps. IX-XI; 
also see J. G. Smith and A. J. Duncan Sampling Statistics and Applications, 
pp. 101—103, hereafter referred to by the short title Sampling Statistics. 

2 For further discussion of time-series analysis see Chaps. XIX—XXV., 

3 FrscuER, Lupwic, The Structure of Thought, p. 360. Cf. for illustrations 
of such rearrangement of time series, Dickson H. Leavens, “Frequency 
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VARIABLE QUALITIES, OR ATTRIBUTES 


When statisticians have to deal with variable qualities, such as 
different colors, different races, different climatic conditions, 
different geographical locations, or different intellectual or moral 
capacities, their problems are principally questions of the con- 
sistent use of class or group distinctions. Usually there is no 
need for elaborate quantitative treatment. Yet, so far as 
possible, statisticians strive to convert quality, or attribute, 
differences into quantitative terms, and when that is accom- 
plished, their analysis is similar to the analysis of frequency 
series. It has been found, for example, that certain tests can 
be made to provide quantitative measures of differences in 
intelligence, native or acquired; and a large scope for statistical 
analysis lies in the field of education and psychology through 
the use of these tests. 


Distributions Corresponding to Time Series,” Journal of the American 
Statistical Association, Vol. 26 (1931), pp. 407-415. 


PART II 
Analysis of Frequency Distributions 


CHAPTER VI 
SUMMARIZATION AND COMPARISON 


For summarization and comparison of static variation the 
fundamental tool of analysis is the frequency distribution, its 
graphic presentation, and the analysis of its characteristics. The 
frequency distribution portrayed in a table or in a graph gives 
a picture of the whole of the variation relative to some 
particular matter; but how can comparisons be made? The 
frequency table, especially if large numbers of magnitudes are 
involved, even though it is admittedly better than a haphazard 
arrangement of the data, requires study before the mind can 
grasp its full significance. If two frequency distributions of 
heights (e.g., of mature males in New Jersey in 1800 and of 
mature males in New Jersey in 1900) are to be compared, the 
frequency table could be used, but the total number of cases 
measured might be different in each year taken, which would 
make it more difficult to discern the similarity or lack of similarity 
of the two distributions. To make comparisons, a chart could 
be drawn; but a chart may be large or small depending on the 
scale used, and differences would then appear from purely 
arbitrary, mechanical causes having no real significance. More- 
over, if the heights of these same males and also their weights 
are to be compared, a comparison of nonhomogeneous units 
(inches of height and pounds of weight) is required. Clearly 
some method or methods of summarization and comparison of 
frequency distributions must be devised. 

Use of Frequency Distributions. The common practice of 
attaining a summary figure by “averaging” is familiar to all, but 
it should be clear that an average, taken by itself, is indeed a 
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very “summary” expression for a variable. It is one value, used 
to represent a whole series of variations; and a study of the varia- 
tions about the average may be as important as or more important 
than the study of the average alone. In statistics and in most of 
the fields of study that use statistics and statistical methods, the 
average is generally a convenient point of departure for a more 
adequate analysis of the variable. ' 

Types of Comparison. There are six possible ways in which it 
may be desirable to obtain summary figures and to make com- 
parisons. This may be explained by the use of diagrams, as 
follows: 

In Fig. 58, the central tendency, or average, is located at A, 
which is plumb with the peak 
of the frequency curve. In this 
figure the central tendency is 
typical, in the sense that it is a 
magnitude that occurs more 
frequently than other magni- 
tudes. It may be looked upon 
quite rationally as a norm, or 
typical value. In such a case, Quantity variations 
average value has a signifi- Fie, 58.—A central tendency as a 
cance for itself, as a summary Ge 
value, but its principal use is still a comparative one. For 
example, suppose that in Fig. 58 the quantity variations (the 
x scale) are heights of children of a specified age while the curve 
represents the number of children having the indicated heights. 
The question is asked whether or not a certain child is normal in 
height. If the child has less height than height A, how much less 
must he be so that lack of development in this respect indicates 
need for medical advice? At once it is suggested that it is impor- 
tant to determine how much on the average children vary from 
this normal height. Accordingly, the principal use of the average 
as a summary figure, when used as a norm, is to compare indi- 
vidual variations with the average and to compare individual 
variations with the average amount of variation to be expected. 

The second type of comparison is the difference in central tend- 
encies existing between two distributions a and b, as illustrated 


1Fisuer, R. A., Statistical Methods for Research Workers (1941), Section 1. ` 
References to section numbers are valid for any edition of this book. 


Frequencies 


160 ANALYSIS OF FREQUENCY DISTRIBUTIONS 


in Fig. 59. This difference is measured by comparing the aver- 
ages of the two distributions; for example, by the comparison of 
the average height of children in third grade with the average 
height of children in sixth grade. Such a comparison is rational 
only where the units of the two frequency distributions are 
comparable. 


3 

8 e e š 
o 
: ; 
Š £ 
De 

n 1 : = 

Quantity variations Quantity variations 

Fia. 59.—Two different central tend- Fic. 60.—Similar central tenden- 

encies. cies but different variability about 


the central tendencies. 


The third type of comparison is illustrated in Fig. 60, in 
which some sort of measurement of the variability of the varia- 
tions about the average is required for making the compari- 
son; for example, an average of the variations from the central 
tendency could be used. Such measures are called “measures of 
variability.” 


Frequencies 


Quantity variations 


Fig. 61.—Similar central tendencies, but different types of skewness of distribution 
about central tendencies, 

Figure 61 illustrates the fourth type of comparison, Fre- 
quency curves a and b have peaks plumb with the same quantity, 
A, but a is skewed to the left and b is skewed to the right. The 
central tendency A is a value of greatest frequency in both 

` curves, but the lower range of curve a is farther from A than is 
the lower range of b; and the upper range of g is much nearer 
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to A than is the upper range of b. School grades sometimes 
have a frequency distribution like a, with the most common grade 
around 70 or 80 and with very few above 90, yet with some grades 
below 20 or even as low as 10. Personal incomes are distributed 
like curve b, with the most common income at an amount near 
to the lowest and a few incomes 
XX amounts far above the most 
common amount. When fre- 
quency curves like a or b in Fig. 
ìl are encountered, it is desir- 
able to have some way to meas- "T j 
ure skewness and evaluate its A "ns 
Š : S S Quantity variations 
importance in connection with H ° 
š £ š Fira, 62.—Different central tendencies 
1e interpretation of other statis- and different variabilities. 
ties about the frequency curve. 

In the fifth type of comparison, illustrated by Fig. 62, not only 
may it be desirable to compare average with average and variabil- 
ity with variability in aggregate terms, but it may be essential also 
io find a way to compare relative variability. The variability in 
b relative to its average may not be so much larger than the 
relative variability in a as the graph seems to show. The graph 
shows that the absolute varia- 
bility in b is greater; but it may 
be that the relative comparison 
is the more significant one. To 
make the relative comparison 
requires the calculation of further 
information. 

Curves a and b in Fig. 63, which 

A illustrates the sixth type of com- 
Quantity variations parison, have the same central 

Fra, 63.—Similar central tend- tendencies and approximately the 
SE similar Yanasitu aaa same average deviations about 
concentrations at center and along their central tendencies; but b has 
sa a relatively greater concentration 
of small variations close to the central tendency and also rela- 
tively more extreme variations than doesa. Another way of 
looking at this difference is to note that the shoulders of a are 
broader than the shoulders of b and that the top of a is flatter 
than the top of b. The relative flatness of top or breadth of 
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shoulder of a distribution is called “kurtosis.” The measure- 
ment of this characteristic is important in determining the relative 
importance of small variations from average in the two curves. 

It appears to follow from the above discussion of six types 
of comparison that the analysis of frequency distributions requires 
the calculation of the average and, in addition, the calculation 
of measures of dispersion.' 


THEORETICAL SIGNIFICANCE OF FREQUENCY CURVES 


Histograms and Frequency Curves. It has been noted that a 
frequency distribution may be graphed in the form of a histogram, 
that is, a figure in which the frequency of any class interval is 
represented by a rectangle erected on that interval as a base and 
with a height equal to the observed frequency.” If the data are 
continuous in character, that is, if they change by very small 
jumps, it may become reasonable to represent the frequency 
distribution by a smooth frequency curve rather than by a broken 
histogram. 

Area Histograms. It is possible to make certain modifications 
in the form of the ordinary histogram to represent the frequency 
of cases occurring in any class interval, not by the height of the 
rectangle, but by the area of the rectangle. If the class interval 
is equal to unity, an area histogram is identical with one in which 
frequencies are represented by heights, since the altitude multi- 
plied by the base equals the area. But if the class interval is 
greater than unity, the height of an area histogram will be pro- 
portionately reduced; if the class interval is less than unity, the 
height will be proportionately increased. This follows because in 
an area histogram the frequency of any class interval is given by 
the height of the rectangle erected on it, multiplied by the length 
of its base (that is, by the size of the class interval). In histo- 
grams of the area type, it follows that the total area of the 
histogram always equals the total number of cases, N. 

Relative Frequencies. The histogram may be further modified 
by making it represent relative or proportional frequencies, 
rather than absolute frequencies. Following is a table showing 
a proportional frequency distribution 2 

1 See pp. 168-195. 

2 For illustrations, see Figs. 64 and 66, pp. 187-188. 

3 Cf. p. 141. 
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MATERNAL MORTALITY IN Crrres oF 100,000 oR More POPULATION IN THE 
Unirep SrATES, 1938 


Relative number of cities 
Death rates 
Number of 
DT KEN Expressed as Expressed as 
proportions percentages 
F p 
"E (ëmge 5 (S) x 100 
q) (2) (3) (4) 
1- 2 0.022 2.2 
2- 16 0.172 17.2 
3- 18 0.193 19.3 
4- 20 0.215 21.5 
5- 15 0.161 16.1 
6- 10 0.108 10.8 
7- 4 0.043 4.3 
8- 6 0.064 6.4 
9- 0 0.000 0.0 
10- 2 GEES 2.2 
93 1.000 100.0 


M eem 


In the above table the figures in eolumn (3) represent the pro- 
portionate frequeneies, namely, the proportionate number of 
cities having the specified maternal mortality rates. Since this 
illustration has a class interval of 1, an area histogram could be 
obtained by plotting the frequencies of column (2) in the form of 
vertical bars, with the heights equal to the respective frequencies. 
A proportional area histogram could be obtained by similarly 
plotting the frequencies shown in column (3); because in the 
resulting histogram, the area of each rectangle would represent 
the proportion of the total number of cases falling in a class 
interval; it would represent P/N instead of F. The total area 
of such a histogram will always equal unity, just as the total of 
column (3) equals unity. This will be true no matter what the 
form or shape of the histogram, because EP = N. 

Frequency Curves. Suppose that the data, from which the 
histogram has been constructed, are a sample from a very large 
set of cases, theoretically an infinite set. For instance, the 
data might be the heights of 100 adult males of the white race, 
instead of the mortality statisties above illustrated. The 100 
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heights, then, would be a sample of the heights of all adult men 
of that race, presumably millions of men. In such a relatively 
small sample, the size of the class interval cannot be reduced with- 
out causing the histogram to show very irregular fluctuations. 
If, however, many cases are added to the number in the sample, 
say heights of 200 men, the size of the class interval could be 
reduced, for example, from 10 units to 5 units, without causing 
the occurrence of such irregularities. In fact, if the number in the 
sample is made larger and larger and at the same time the size of 
the class interval is continuously reduced, the histogram will 
tend to become more and more regular and the tops of the rec- 
tangles, which are getting narrower and narrower, will come 
closer and closer to forming a smooth continuous curve (a fre- 
quency curve). In such a manner the frequency curve may be 
viewed as the limit that an area histogram of relative frequencies 
approaches as the number of cases is increased and the size of the 
class interval is reduced indefinitely. The frequency curve is the 
distribution of a theoretically infinite set of data, with a theoreti- 
cally infinitesimal class interval. 

Being the limit approached by an area histogram of relative 
frequencies, the frequency curve has a total area (between the 
curve and the x-axis) that is always equal to unity. Further- 
more, any section of area under the curve! will give the relative 
frequency of the cases falling within the class interval marking 
off that section of area. It is upon this basis that tables of 
relative frequencies are constructed for certain well-known 
frequency curves.? 

Uses of Frequency Curves. Frequency curves are hypothetical, 
but they are idealizations of frequency distributions that are real. 
They serve many useful purposes and in the theory of statistics 
they are indispensable. One important use of frequency curves 
is the graduation of frequency distributions obtained from actual 
observation. Suppose, for example, that a frequency distribu- 
tion has been constructed, using a class interval of 10 units. 
Suppose further that the number of cases is such that any smaller 
class interval would introduce marked irregularities into the 
distribution, irregularities that, it is believed, would not be present 


1 See Fig. 91, p. 264 and Fig. 94, p. 277. 
7 See Appendix, Table VI. 
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if an infinitely large number of cases were observed. In this 
case a frequency curve fitted to the distribution (histogram) 
may be the best means of estimating the true frequency for any 
given class interval. In other words, the frequency curve affords 
a graduation for the frequency distribution. The frequency 
curve makes it possible to interpolate values not given directly 
by the original sample frequency distribution. 

Besides serving to graduate a given set of data, frequency 
curves facilitate in other ways the description and comparison of 
frequency distributions. For instance, the peakedness or flat- 
ness of a particular frequency curve, called the “normal fre- 
queney curve," is taken as the standard to which the peakedness 
or flatness of a given distribution is generally referred. Again, 
theoretical analysis shows that data affected by certain kinds of 
forces will tend to be distributed in the form of particular types 
of frequency curves. Certain types of curves, therefore, become 
the expected norm for all data affected by particular kinds of 
forces. As a consequence, the hypothesis that variations in a 
given set of data have resulted from certain forces may be tested 
by noting how well the distribution of the data conforms to the 
type of frequency curve that these forces may be expected to 
produce. In such instances frequency curves help to explain the 
underlying causes of variation. Such an analysis is of special 
importance when it is assumed, as it is in so many statistical 
procedures, that chance is the fundamental cause of variation. 
It is to be noted that a difference in the general form of two 
frequeney distributions may in some cases be looked upon as of 
more fundamental importance than a mere difference in their 
averages, dispersion, and the like, because such a difference in 
form may indicate a contrast in the type of forces causing varia- 
tion in the data. To detect a fundamental difference of this 
kind frequency curves are used. 

Still another useful purpose served by frequency curves is in 
sampling analysis. Since a chapter is subsequently devoted to a 
discussion of sampling, it need merely be touched upon and 
simply illustrated in very general terms at this point.! For 


1See Chap. XII; see also Surrg, J. G., and A. J. Duncan, Sampling 
Statistics, pp. 107-109, Parts H and III. 
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illustration, suppose that a large number of balls, each with a 
number written on it, are placed in a big bowl and thoroughly 
mixed. Suppose that 10 balls are drawn at random from the 
bowl and their numbers read off and averaged. Suppose that 
this sampling operation is repeated over and over again, the 
balls being replaced and thoroughly mixed after each set of 
drawings. Experience shows that unless the distribution of num- 
bers is freakish, the distribution of sample averages will approxi- 
mate the so-called “normal” frequency curve. If, instead of 
the average of the respective 10 readings, a certain measure of 
the variation around their averages, known as the “variance,” 
had been recorded in each instance, then the frequency distribu- 
tion of these measurements of variation would have tended to 
conform to a frequency curve known as the ‘x? curve."! The 
significant thing is that “ sampling distributions” of this kind tend 
to conform to specific frequency curves that may be described 
by definite mathematical formulas. In general, these formulas 
are expressed in terms of the characteristics of the “population” 
(in the illustration, the bowl of numbers) from which the sam- 
ples are drawn. The consequence is that if a random sample 
has been obtained from an unknown population, it is possible 
from knowledge of the sampling distributions of various sample 
measurements to make certain inferences regarding the nature 
of the population from which the sample has been drawn. This 
is probably the most important use that is made of frequency 
curves in statistical analysis. 


MEASUREMENTS OF SUMMARIZATION AND COMPARISON 


Population, Parameters, and Statistics. Population. To say 
that the population of the United States is one hundred and 
thirty million people is a familiar use of the word “population.” 
In statistics the word is used in the same familiar sense, but it is 
also used in a more general sense, referring to the count of 
persons or of animals of any kind or even to the count of inani- 
mate things. To statisticians the term means all the things, 
animate or inanimate as the case may be, of a given kind in 
the known universe or in a specified universe, for example, all the 
people on the earth, or all the people in the United States if the 


1 This is read “chi square”; the letter is the Greek small chi. 
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universe is more specific. An example of an inanimate popula- 
tion would be all the petroleum in the known universe or, if a 
more specific universe is considered, all the petroleum in the 
United States. 

Parameters and Statistics. In the theory of statistics the 
measurements of the characteristics of the population are called 
“parameters.” The average height of all people living in the 
United States is a parameter of the population. No one has 
ever actually measured the heights of all the people living in the 
United States, and it is not likely that anyone ever will do so. 
Nevertheless, this population does exist. In practice, it is 
much easier to estimate the average height of all the people by 
taking the average of a sample of the people. This latter 
average, the average of the sample, is called a “statistic.” 
Accordingly, parameters are measures of the characteristics of the 
population, and the corresponding sample measures are statistics 
commonly used to estimate these parameters. A statistic is thus 
a value computed from an observed sample in order to char- 
neterize the population from which it is drawn. Parameters are 
the characters of the population. + 

In accordance with this terminology, the quantities to be 
obtained as measures of central tendencies are “statistics,” the 
arithmetic mean is a “statistic,” the range (difference between 
the highest and lowest magnitude) of a frequency distribution is 
a “statistic.” 

Averages. There are several kinds of averages. The one 
most familiar is the arithmetic mean. The others most gen- 
erally presented are the median, mode, geometric mean, and 
harmonie mean. The most commonly used averages are the 
mean, the median, and the mode. In this chapter each will be 
viewed in its simplest aspect, and at the same time the symbolie 
language associated with the analysis of frequency distributions 
will be introduced. 

The Arithmetic Mean. By definition, the arithmetic mean 
is the sum of the cases divided by the number of cases. For exam- 
ple, taking a simple case of ungrouped data, i.e., where the 
frequencies are 1 throughout (each X occurs once), Fi, Fo, . . . , 
F. each = 1: 


‘FisuEr, op. cit., pp. 7-8, 41. 
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The sum of the variable magnitudes in this case is 42. The 
number of variable magnitudes is 7. Hence, by definition, the 
arithmetic mean is 47, or 6. 

Symbolically, 


49-9 e OUT En ye Ieee 
P= BE Nite, eebe C CERE 


The arithmetic mean is represented by the symbol X, and 
hence 


u) 


In frequency distributions the F is not equal to 1 throughout 
but varies. An illustration of the calculation of the arithmetic 
mean of a frequency distribution is shown below: 


x F FX 
2 3 D 
3 3 9 
4 6 24 
5 9 45 
6 6 36 
7 3 21 
2ForN =30 FX = 141 


t. 


Tt should be noted that the sum of the X's cannot be obtained 
by adding the first column because the various X's occur 3, 6, or 
9 times. Consequently, the sum of the X’s is obtained by mul- 
tiplying each X by its respective frequency and then adding 
the products. 


ZFX 
ZForN 


141 
30 


lI 


lI 
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and therefore, by definition, 
X = 3⁄2 = 47 
If the frequencies of a frequency distribution are expressed in 
relative numbers, 7.e., if each frequency is expressed relative to 


the total number of cases, F,/N, P,/N, . . . , F,/N, the arith- 
metic mean is merely the sum of the third column, as follows: 


Ki 


F 

x N x* 
2 0.1 0.2 
3 0.1 0.3 
4 0.2 0.8 
5 0.3 1.5 
6 0.2 1.2 
T NO gi 07 

1.0 4.7 


Following the definition, the arithmetic mean is the sum of the 
third column divided by the sum of the second column; but the 
sum of the second column is 1, by definition. Consequently, 
the arithmetie mean is the sum of the third column. 


z=) yx (1a) 


This modified form of the definition of the arithmetie mean is very 
convenient in certain statistical problems. 
The sum of the deviations of the cases from the arithmetic 
mean is equal to zero. This may be demonstrated as follows: 
Given the variable X;, Xs, . . . Xn, X = ZFX/N. 


P (X: — X) = Pimi 
Pi — X) = Fars 
FA, X) = Fa. 
SFX — NX = XFz by adding 


"The small z is used regularly to refer to the deviations of the 
variable from the arithmetie mean. Ü 

When added, SFX becomes NX because X is constant and 
=F is equal to N. 
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If the value of X given in Eq. (1) is substituted in 


SFX —NX = Ze 
it becomes 


sry A E 
XFX NW = Ze 


By canceling the N, this becomes 
XFX — XFX = £XFz 
and hence 
>Fr = 0 (2) 

The Median and the Quartiles. In its original simplicity and 
by definition the median is not a mathematical concept like the 
arithmetic mean. On the contrary, the median is a position 
average. By definition, the median is that value than which there 
is an equal number of cases larger and smaller. When the cases 
are arranged in an array, the median is either the value of the 
middle one (when there is an odd number) or some value between 
the two middle ones (when there is an even number). Nor- 
mally, in the latter instance, the arithmetic mean of the two 
middle cases is taken as the median value. To illustrate from a 
yery simple example with an odd number of cases: 


em Ae o to — ba 


Xa, or 6, is the median by definition, because it is the middle 
one in the array. Mi = 6 (Mi is the conventional symbol for 
median), 

It is to be noted that X = 5.143. 

In this illustration it is seen that 1, the first case, is 5 smaller 
than the median, while 9, the last case, is only 3 larger than the 
median. This preponderance of smallness of the variable 
results in an arithmetic mean smaller than the median. By 
definition, the arithmetic mean is affected by every variation and 
consequently by extreme variations. It is affected by the size 
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and the number of cases above and below it. The median, 
on the other hand, by definition, is not affected by the size of 
the cases above or below it. 

When the frequencies vary, the median may be found as in 
the following illustration: 


x F Cumulated # 
1 2 2 
2 4 6 
3 5 11 
4 8 19 
5 7 26 
6 3 29 
7 2 31 
31 


Thus, there are 2 cases where X = 1; there are 4 cases where 
X = 2; ete. In all, there are 31 cases (SF = 31 = N), and the 
middle one is the sixteenth. Mi = 4. That the median is 4 is 
quickly seen by the examination of the cumulated frequencies in 
the third column. This is equivalent to taking the median equal 
to the (N + 1)/2th ease, a procedure often adopted when dealing 
with ungrouped data." 

In general terms, the first quartile Q: is that value below which 
one-fourth of the cases fall and above which three-fourths of the 
cases fall. Similarly, the third quartile Qs is that value below 
which three-quarters of the cases fall and above which one-fourth 
of the cases fall. The median is the second quartile, or Qo. In 
the above frequency distribution, N/4 = 7$ and 3N /4 = 284. 
Qı should thus be some value below which 71 cases fall and above 
which 231 cases fall. For this distribution, it happens that the 
seventh and eighth cases are identical, and it therefore follows 
that the value of Q) is the common value of the seventh and eighth 
cases, or Qı = 3. If the seventh and eighth cases had not been 
the same, then Q; could be taken as a value between the values 
of the seventh and the eighth case to be found by interpolation. 

For ungrouped data, it is recommended that a uniform and 
systematic method of interpolation be adopted, as follows:? 


1 When the data are grouped, it is simplest to find the median by interpola- 
tion within the class interval for the N /2th case. This method is described 


and illustrated in the next chapter. See pp. 218-220. 
2 For the method of interpolation when the data are grouped, see pp. 


218-220. 
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Take Q, as the (N/4 + à) case, Qs, or Mi, as the (N/2 + 3) 
case, and Qs as the (3N/4 + 2) case. Consider, for example, the 
cases 5, 9, 11, 16, 25, 31, 38, 43, 45, and49. The (N/4 + $) case 
would be the (42 + 3) or third case; t.e., Qı = 11. The median 
would be the (5 + 4) or 53th case. Since there is no 53th case, 
however, but only a fifth case, 25, and a sixth case, 31, the median 
is taken as the value that lies just halfway between the fifth and 


sixth cases, Ze, Mi = 25 + E = 28. The third quartile 


would be the (22 + 3) or eighth case; Ze, Qs = 43. 

‘As another illustration, suppose 51 is added to the set of 
numbers, making a total of eleven instead of ten numbers. Then 
the first quartile would be the (Gè + 4) or the 34th case. But 
there is no 34th case, only a third case, 11, and a fourth case, 16. 
Hence, Qi is taken as the value that lies one-fourth of the distance 
between the third and fourth cases. The difference between the 
third and fourth cases is 16 — 11 = 5, and Q; is therefore taken 
as 11 + £ = 12}. Similarly, Mi is the (s+ + 3) or sixth case, 
which is 31. 

The Mode. As in the case of the median, the mode is not a 
mathematical concept. Moreover, it is not a “position average.” 
The mode is an average that is described in terms of relative 
frequency of occurrence. It is defined as the magnitude that 
occurs more frequently than any other. The mode is the most 
probable magnitude and might be considered a “probability 
average,” because it is often thought of in terms of probabilities. 
It may be illustrated as follows: 


x F 
2 1 
i 3 2 
4 4 

6 7 
7 8 

8 5 
9 4 
10 3 
34 


By definition the mode (Mo) is 7, because this value occurs 
more frequently than any of the others. The probability of the 
mode is ër and is greater than the probability of any other value 
of X. 
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Tt is to be noted that the X of this example is 6.706 and the 
median is 7. The mode is not affected by the size of the cases 
above or below it, nor is it affected by the number of cases 
above or below it, within certain rational limits. For example, 
in the illustration, two magnitudes (X = 8) could be added 
to the distribution and the mode would remain 7, as before; 
but if five magnitudes (X = 8) were added to the distribution, 
then 8 would become the mode, as its frequency would then 
be 10. 

It has been established that, when the distribution is only 
moderately skewed, the mode can be estimated from the mean 
and median by the following approximate formula: 


Mo = £ — 3(X — Mi) (3) 


Accordingly, the mode may be estimated if the mean and 
the median have been calculated. In actual problems involving 
grouped data the mean and the median are both more accurately 
determinable than is the mode, and for this reason the above 
formula often gives more satisfactory results than any convenient 
direct procedure. This is called the “mathematical mode.” 
It should be emphasized that the formula should not be used, 
however, if the distribution is very skewed. 

The Geometric Mean. The geometric mean (G.M.) is a 
mathematical concept and is defined as the nth root of the product 
of n variables X. Accordingly, the geometric mean of 5, 8, 
and 25 is (5 X 8 X 25)! = 10. The geometric mean may 
also be defined as the antilogarithm of the arithmetic mean of 
the logarithms of the variable X, i.e., 


Ee Sc a (4) 


This may be illustrated as follows: 


log 5 = 0.69897 
log 8 = 0.90309 
log 25 = 1.39794 


Zle X — 3)3.00000 
1.00000 


The antilogarithm of 1.00000 is 10; hence, G.M. = 10. 


E 
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Just as the arithmetic mean balances the aggregate deviations 
so the geometric mean balances the ratios of the variations; 
that is, 

Xi X: sek PCI 
G.M; X G.M. X GM! (5) 


For, by taking logarithms, this expression becomes 


log X; + log Xs + log X; + : : + + log X, — N log G.M. = 0 
or 2 log X — N log G.M. = 0 


But from Eq. (4), 
N log G.M. = > log X 


and hence the expression is shown to be true. 

In some types of problems the geometric mean gives more 
satisfactory results than the arithmetic mean. For example, it 
is necessary to use the geometric mean to average percentage 
increases of population over successive years or decades or to 
average percentage changes in income, production, and the like.! 
Thus, in the column marked X of the following table the esti- 
mated national income of each year is expressed as a percentage 
of the preceding year: 


Estimated national Increase ational 
Year | income produced in | Wë tage of pres 

billions of dollars M TRES 
1933 47.3 | 
1934 54.6 | 115.43 
1935 59.2 | 108.42 
1936 | 68.9 116.39 
1937 73.1 106.10 
1938 07.0 91.66 
1939 69.7 104.03 


1 Survey of Current Business, Vol. 20 (March, 1940), p. 19; (April, 1940), p. 11. 


If the average annual percentage increase is obtained by 
calculating the arithmetic ayerage, the answer obtained is 
642.03/6 = 107.01, which represents an average annual percent- 


1 Cuappock, R. E., Principles and Methods of Statistics (1925), pp. 126- 
127; Croxton, F. E., and D. J. Cownen, Applied Business Statistics (1939), 
pp. 225-226. 
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age increase of 7.01 per cent. Now, if 7.01 is used as a constant 
annual percentage increase from 1933, the following figures would 
be obtained: 


Constant 7.01 Percentage 


Year Increase from 47.3 in 1933 

1934 107.01 per cent of 47.3 = 50.62 
1935 107.01 per cent of 50.62 = 54.17 
1936 107.01 per cent of 54.17 = 57.97 
1937 107.01 per cent of 57.97 = 62.14 
1938 107.01 per cent of 62.14 — 66.50 
1939 107.01 per cent of 66.50 — 71.16 


But in 1939 the actual figure, as shown in the preceding table, 
was 69.7; and the average percentage yearly increase could 
not have been so large as 7.01. To obtain the correct per- 
centage inerease, the geometric and not the arithmetic mean 
should be calculated in this instance. Following the formula 
given above for the geometric mean, it is calculated for this 
problem as follows: 


ati " 
Year | income produced in| EN, 
billions of dollars St: 

1933 47.3 

1934 54.6 2.06232 
1935 59.2 2.03511 
1936 68.9 2.06591 
1937 73.1 2.02572 
1938 67.0 1.96218 
1939 69.7 2.01716 


If the average annual percentage increase is now obtained 
by ealeulating the geometrie mean of the rates of increase, 
by first taking the arithmetie mean of the logarithms 


12.16840 


6 — 2.02807 


and then taking the antilogarithm (antilogarithm of 2.02807) 
the answer obtained is 106.68 or an average annual percentage 
increase of 6.68 per cent. If 6.68 is assumed to be the average 
annual percentage increase since 1933, the following figures would 
be obtained: 


176 ANALYSIS OF FREQUENCY DISTRIBUTIONS 


v 
Constant 6.68 Percentage 


Year Tnerease from 47.3 in 1933 

1934 106.68 per cent of 47.3 = 50.46 
1935 106.68 per cent of 50.46 = 53.83 
1936 106.68 per cent of 53.83 = 57.43 
1937 106.68 per cent of 57.43 = 61.27 
1938 106.68 per cent of 61.27 = 65.36 
1939 106.68 per cent of 65.36 = 69.73 


In 1939 the actual figure, as shown above, was 69.7, to which 
69.73 is a close approximation; and hence the average annual 
percentage increase apparently was in fact close to 6.68. 

The Harmonic Mean. The harmonic mean (H.M.) is the 
reciprocal of the average of the reciprocals of observations of the 
variable X, thus: 


Hex (6) 
> 1 
Accordingly, the harmonie mean of 5, 8, and 25 would be found 
as follows: 

From a table of reciprocals or by calculation, the reciprocals 


of 5, 8, and 25 are determined—0.200, 0.125, and 0.040—and 
hence the harmonic mean, by definition, is 


3 3 
0.200 + 0.125 + 0.040 — 0.365 


| 


> 


8.22 


The geometric mean of these three numbers, as discovered above, 
is 10; the arithmetic mean is 5 + 8 + 25 = 38 divided by 3, 
or 12.67. It is thus seen that the arithmetic mean is the 
largest, the geometric mean next, and the harmonic mean the 
smallest of these three averages. It is always true that! 


H.M. < G.M. < X (7) 


The usefulness of the harmonic mean arises in connection with 
certain types of problems in which variable quantities of one 
variable are compared with a constant quantity of another. 
For illustration, speeds may be looked upon as variable numbers 
of miles per minute (a constant quantity of time) or as variable 
amounts of time required to cover a given distance. Similarly, 


1 For a proof see G. R. Davies and W. F. Crowder, Methods of Statistical 
Analysis in the Social Sciences, p. 313. 
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prices may be looked upon as variable amounts of money per 
unit of goods sold (a constant quantity of goods) or as variable 
amounts of goods that can be purchased with $1. In many 
such problems the choice of the variable for which the quantity 
is always constant is optional, depending upon the type of 
information it is desired to emphasize. There is a nice dis- 
tinction between the mean and the harmonie mean wherever 
such interchangeability is present. This will be illustrated 
by examples. 
The accompanying table shows data on prices of corn. 


WHOLESALE Price or No. 3 YeLLow Corn 


Year Dollars per Bushel 
1913 0.61 
1919 1.59 
1929 0.93 
1939 0.50 


Source: Survey of Current Business, April, 1940, p. 18. 
In the table the amount of money varies, but the quantity 
of corn is constant. The average price per bushel may be cal- 
A 3.63 
culated directly from this table, thus: -p~ = 0.9075. In order 


to obtain the harmonic mean, the reciprocals of these prices must 
first be calculated. 


WmuoLEsALE Price or No. 3 YELLOw Corn 


Year Bushels per Dollar 
1913 1.64 
1919 0.63 
1929 1.08 
1939 2.00 


The average of these reciproeals must be computed, thus: 


zx = 1.3375. 


The reciprocal of the latter number must be obtained, namely, 
0.74766. "This last number is the harmonie mean of the prices 
expressed in dollars per bushel. The harmonic mean, therefore, 
of the prices per bushel of No. 3 yellow corn is approximately 
75 cents; and it is the price per bushel of the average amount of 
No. 3 yellow corn that could have been purchased for $1, or, 
in other words, it is the reciprocal of the average amount of No, 3 
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yellow corn that could have been purchased for $1. If the 
reciprocal of the mean price, 0.9075, is taken, a figure of quite 
different significance is obtained, namely, 1.102. This reciprocal 
is the average number of bushels that could have been bought at 
the mean price. 

In deciding whether to use the arithmetic or the harmonic 
mean in any given problem, it should be determined which 
magnitude should be regarded as the constant (for example, the 
amount of corn bought or the amount of money spent), a matter 
that can usually be decided without difficulty in a practical 
problem. If the data are recorded with the appropriate quantity 
constant, the arithmetic mean may be used. If the appropriate 
item is made a variable as the data are tabulated, the harmonic 
mean must be used. 

Another illustration will serve to clarify further the use of 
the harmonic mean. The efficiency of a fighting airplane may 
be determined, in part at least, by its speed, which can be 
expressed either as the number of miles flown per minute or the 
amount of time required to fly a mile. Following are the results 
of tests of a plane under trial: 


Resuits or Tests 
Miles EE eee 6, 4, 7, 6, 5 


Is the significant measure the rate at which a plane flies or 
the amount of time required to fly a number of miles? If it is 
admitted that the rate at which the plane flies is the important, 
consideration (that is, the number of minutes required to fly a 
mile), the reciprocal of the harmonic mean is the relevant meas- 
ure, inasmuch as the recorded data make the time element con- 
stant and not the distance flown. The arithmetic mean, if 
calculated, would not be lacking in significance, but its reciprocal 
should not be compared with rate measures in which the number 
of miles is constant and the time varies. The average number of 
miles per minute is 9 = 5.6. The reciprocal of this number, 
0.17857, is not the harmonic mean, and it is not the average time 
that it takes to travel a mile. On the contrary, 0.17857 minute 
is the amount of time it requires to go a mile when traveling at the 
average number of miles per minute. The average amount of 
time required to travel a mile is a different thing, namely, the aver- 
age of $, +, +, $, and $ minutes, or 0.179 minute. While the two 
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results are close together in value, it would make a large difference 
in calculations having to do with hours of time if the arithmetic 
mean were used when the harmonic mean ought to have been 
used. 

The Concept of an Average as a Summary Figure. The general 
significance of an average as a summary figure may be illustrated. 
Suppose that information concerning the heights of mature males 
in Newark, N.J., is desired. Heights of all the mature males 
in Newark are therefore measured to the nearest $ inch. The 
data collected will constitute complete information about the 
heights of mature males in Newark. But this knowledge, in 
untabulated form without summary figures, is not easy to com- 
prehend. It is necessary to analyze this total knowledge in 
some manner so that it may become more significant. The 
manner in which the analysis will proceed depends upon the 
object in mind; for example, an answer to any one of the following 
questions may give a more significant view: 

1. What height will coincide with the greatest number of 
recorded observations? The answer to this question is, of 
course, the mode. 

2. What is the height such that greater and smaller heights 
have been recorded with equal frequency? The answer to this 
question is the median. 

3. What is the height such that the sum of the squares of the 
differences between it and the recorded observations is a mini- 
mum? Or what.is the height such that the algebraic sum of 
the differences between it and the recorded observations is zero? 
The answer to this question will be the arithmetic mean. 

4. What is the height H such that the product of the ratios of 
the recorded observations to H is unity? The answer to this 
question will be the geometric mean. 

5. If several rates of speed were given in miles per second, how 
many seconds on the average will be required to travel 1 mile? 
The answer is the reciprocal of the harmonic mean. (Heights 
could not be used as an illustration because the harmonic mean 
is significant only in the dual-variable type of quantity expres- 
sion, as explained above.) 

The term “average” is a generic term, and any one of these 
summary figures may be called an average; the decision as to 
which average should be used depends upon what question is to 
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be answered. If the median height is known and the height of a 
man is greater than the median, it can be inferred that the man is 
taller than most men, If a man’s height is equal to the mode, 
it is known that he has the most common, or usual, height. If 
height is analyzed in the abstract, as it might be in research on 
the effects of heredity and environment, the arithmetic mean is 
likely to be used, for such analyses ordinarily involve the solution 
of problems in mathematical terms. 


MOMENTS 


Definition. In physics, “moment? is a measure of a force with 
respect to its tendency to produce rotation. The strength of 
the tendency depends on the amount of force and the distance 
from the origin of the point at which the force is exerted. If a 
number of forces, P, Fo, . . . , Fm at distances Xi, X», . . . Xn, 
are applied, the moment of the first force about the origin is 
FıXı, the moment of the second force is FX», ete. These 
moments are additive so that FX is the total moment about 
the origin. If the total moment is divided by the total force, 
the quotient is termed “a, moment coefficient.” The formula is 
ZFX/N, where N = >F is the total force. 

It will be recognized that the formula for a moment coefficient 
is identical with that for an arithmetic mean. This identity 
has lead statisticians to speak of the arithmetic mean as the 
“first moment about the origin.” Technically the mean is a 
moment coefficient and not a total moment, but in the case of 
frequency curves, with which mathematical statistics is primarily 
concerned, the total frequency N is generally taken as unity," 
so that the total moment and the moment coefficient are identical. 
In any case, it has become customary in statistics to speak of 
the mean X = ZFX/N as the first moment about the origin, and 
the distinction between total moment and moment coefficient is 
ignored. 

The concept of moments is also extended to higher powers. 
Thus in statistics. ZFX?/N is termed the “second moment 
about the origin,” and XFX?/N is called the “third moment 
about the origin," etc. In general, the moments about zero 
are as follows: 


1 See pp. 276-277 and Appendix, Table VI. 
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y= EX 
N 
- zEX* 
ed (8) 
2 BEX 
P N 


When the moments are calculated, not from zero, but from 
the mean, or “centroid,” they are defined as follows: 


ZFz 
PUN 
EE 
Har N 
— Ze (9) 
Har RAN, 
ken 
Hs = Hy 


where z = X — X. It will be noted that it is a convention in 
statistics to use small z to represent deviations from the 
arithmetic mean. 

The first moment about the mean is the sum of the deviations 
from the mean multiplied by their respective frequencies and 
divided by the number of cases in the frequency distribution. 
The second moment is similarly obtained except that the devi- 
ations are squared before adding. To obtain the third moment 
the deviations are cubed, multiplied by their respective fre- 
quencies, added, and divided by the number of cases; to obtain 
the fourth moment the deviations are raised to the fourth power; 
to obtain the nth moment the deviations are raised to the nth 
power. This is indicated in the above equations. As already 
demonstrated above,! the first moment about the mean is equal 
to zero. In mechanics, the square root of the second moment 
is the radius of gyration of a set of s equal particles, with respect 
to a given centroidal axis.* 

Purpose of Moments. For statistics the moments serve 
primarily as intermediary values. The moments about the 


‘See pp. 169-170. 
2 Cf. Rxgrz, H. L., Mathematical Statistics (1936), p. 20. 
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mean are intermediary values useful for calculating measures 
of variability, skewness, and other characteristics of the fre- 
quency distribution. Because of their great convenience in 
obtaining measures of the various characteristics of a frequency 
distribution, the calculation of the first four moments about the 
mean may well be made the first step in the analysis of a fre- 
quency distribution. This valuable feature will be illustrated 
in the ensuing pages and in the next chapter. Following are 
important generalizations concerning moments measured from 
the arithmetic mean for all frequency distributions, 


(10) 


Ë 
" l 
° 


He = ct 
and in symmetrical distributions, 
us =0 
us = 0 
ass 0 


ai) 


for all “odd” numbered moments. 


VARIABILITY 


It was indicated above that the chief interest of statistics is 
in variability; summary figures such as averages are useful as 
points of departure for further study of the frequency distribu- 
tion. It may be noted that the principle of averaging is funda- 
mental throughout; for all the various methods of summarizing, 
whether it be central tendencies or variations from points of 
central tendencies, use the principle of averaging as a method of 
summarization or measurement. 

The Range. The most obvious method of measuring vari- 
ability is to take the difference between the highest value and the 
lowest value; this difference is called the “range.” Thus in a 
set of several hundred grades, if the highest grade is 92, and the 
lowest 13, the range is 92 — 13 = 79. The range is easily 
understood and easy to compute, but it is dependent entirely on 
the two extreme items. It is therefore seldom used as the 
measure of variability when accuracy and stability of results are 


* Read “sigma square.” 


D 
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desired. Tt is better to use some measure of variability that is 
dependent on more than just two, if not all, of the cases. + 
The Average Deviation. The average deviation (A.D.) is the 
arithmetic average of the variations of the data from their central 
tendency. This measure of variability may be illustrated as 
follows: 


Deviations from the Mean 

Variable (K = 6) 
x 

—4 

-3 

-2 

0 

2 

3 

4 

T9 

SCH 

z|z| = 18 


C qw 0 ep ve Eë to Si 


The deviations have to be added without regard to sign—other- 
wise, their sum is zero. In this example there are seven devia- 
tions (one of which is zero), and hence the average deviation 
is 48, or 2.57. 

Ze 


A.D.x E SN (12) 


wherez — X — X. 

The average deviation could be measured from the median or 
the mode, as well as from the arithmetic mean. In fact, it is 
usually measured from the median since it is least when so 
computed. Let X — Mi - a". Then the formula for the 
average deviation from the median is 

SET 
A.D. = zE (12a) 
Mi is used as a subscript to A.D. to indicate that the average 
deviation is measured from the median. It is to be noted that 
the average deviation is a measure of variability that is based 
on all the observed cases. 

! See, however, Smith and Duncan, Sampling Statistics, pp. 294-296, for 
the use of the range in sampling analysis. 

2 Cf. KeLLeyY, Truman, Statistical Method (1923), p. 74. 
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In the foregoing example each X occurred only once, and so 
the His were all unity. When there is more than one of any X, 
the formulas are d 


_ — Sir 
A.D.x = N 
J 


Standard Deviation. The most generally used measure of 
variability, however, is the standard deviation. This will be 
readily understood when it is seen that the standard deviation 
is easily treated mathematically; the average deviation has 
very distinct limitations in this respect owing to the disregard 
of plus and minus signs. In the case of the standard deviation 
this defect is overcome by squaring the deviations before they 
are averaged and then taking the square root of the average. 
The symbol for the standard deviation is the small Greek letter c, 
read “sigma.” By definition, the standard deviation is the 
square root of the average of the squared deviations from the mean. 


Symbolically, 
Sfr? 
jia 4 ae (13) 


As may be seen by comparing this definition with the definition 
of the moments about the mean [Eqs. (9)], the standard devia- 
tion is the square root of the second moment; t.e., 

s = ui (13a) 


Following is an illustration of the computation of the standard 
deviation: 


Deviations from the 


Variable Mean (X = 5) Deviations Squared 
x z z* 
1 —4 16 
2 —3 9 
3 —2 4 
4 =l 1 
5 0 0 
6 d H 
7 2 4 
8 3 9 
9 4 16 


Da? = 60 


e 
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In the above "illustration the standard deviation is (59), or 
2.58. It will be noted here that the F’s are all unity and there- 
fore do not enter into the calculations. 

Variance. The square of the standard deviation is called the 
"variance." Sincec? = ZFaz?/N, the variance is merely another 
name for the second moment about the mean [see the definition 
of this in Eqs. (9)]. The second moment is smaller when calcu- 


ated from the mean than when calculated from any other point. ! 


SKEWNESS 


Definition and Significance.  "Skewness" means asymmetry. 
"requeney distributions that have extreme variations resulting 
in « longer tail in one direction from the peak than in the other 
direction from the peak are asymmetrical in appearance. Such 
distributions are called *skewed distributions." Up to this 
Joint the discussion has concerned methods for measuring 
Al tendencies (averages) and methods for measuring vari- 
y (average deviation and standard deviation). The 
significance of an average may be considerably modified when 
considered in comparison with the average deviation or standard 
deviation; but, in addition, the significance of the averages 
depends upon the symmetry or lack of symmetry in the dis- 
tribution of individual cases. Measures of skewness are accord- 
ingly desirable. 

Skewness Measured by Relationship between the Mean, the 
Median, and the Mode. Skewness is most easily measured by 
the relationship between the mean, the median, and the mode; 
for it will be recalled that the mode is not affected by the mag- 
nitudes or number of variations above or below it. ‘The median 
is affected by the number of variations, but not by their mag- 
nitudes. The mean is affected by both the number and the 
magnitudes of the variations fromit. Consequently, it would be 
expected that 

1. The mode, by definition, will remain at the point of greatest 
frequency, whether or not distribution is skewed. 

2. The median will be. pulled away from the mode in dis- 
tributions that are skewed, since the larger number of items on 
one side pull it from the point of greatest frequency. 


! Cf., for proof, p. 215. 
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3. The mean in such distributions will be pulled away from the 
mode still farther since the larger number and extreme magnitude 
of the items on one side pull it farther away from the point of 
greatest frequency. 

These points are illustrated in the following frequency dis- 
tributions: 


1. SYMMETRICAL 2. Skewep PosmivELy 3. SKEWED NEGATIVELY 
X F X F X F 
ii 3 1 6 1 di 
2 4 2 8 2 2 
3 6 3 9 3 3 
4 9 4 8 4 5 
5 10 5 7 5 7 
6 9 6 5 6 9 
jJ 6 7 3 7 10 
8 4 8 2 8 8 
9 3 9 1 9 6 

54 49 51 
(1) (2) (3) 
X=5 X = 3.92 X = 6.10 
Mi =5 Mi = 3.69 Mi = 6.33 
Mo = 5 Mo =3 Mo =7 
o = 2.08 o = 2.05 o = 2.01 


Figures 64 to 66 show in graphic form the frequency dis- 
tributions given on this page. The relationship between the 
three averages will be more clearly visualized from these figures. 

In Fig. 64, for example, all three averages equal 5. In Fig. 65, 
which is the positively skewed frequency distribution, the three 
differ from each other; the mode is 3, the median is 3.69, and the 
mean is 3.92. The extreme variations toward the higher values 
give the frequency distribution a longer tail to the right, or 
toward the higher values of X, and this pulls the median and the 
mean in that direction from the mode. 

In Fig. 66 a negatively skewed frequency distribution is 
jllustrated. This histogram is a graph of the third set of figures 
shown on this page. In this figure, too, the averages differ 
from each other, but here the mode is the largest. The extreme 
variations toward the lower values give the frequency dis- 
tribution a longer tail to the left, or toward the lower values of X, 


1 Calculations of averages were made on the assumption that the z values 
are mid-points of class intervals of unity. 
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and this pulls the median and the mean in that direction from 
the mode; so while the mode is 7, the median is 6.33 and the 
mean is 6.10. 


10 


en 
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Fic. 64.—A symmetrical frequency distribution. 
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Fic. 65.—4 positively skewed frequency distribution. 


Skewness may accordingly be measured by the difference 
between the mean and the mode. For the above examples, 


X—Mo-0  X—Mo=+092  X-—Mo- —0.90 


This is a measure of the aggregate amount of skewness, but 
how significant is this amount? One device to weigh this is to 
compare the aggregate amount of skewness with the standard 


E) 
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deviation and thereby to obtain a coefficient of skewness, as 
follows: 
For example (2) 3 

X-—Mo _ 0.92 


k= SE a 14 
sk = 2.08 0.45 (14) 


For example (3) 


pud Mo 1090. 


= = Soi —0.45 


The relative amount of skewness, or asymmetry, in these two 
distributions comes out equal, although one is positive and the 
other is negative. 
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Fia. 66.—A negatively skewed frequency distribution. 


Another measure of skewness is based on the median and 
the mean. It has been established that in a moderately asym- 
metrical distribution, if the mean is pulled a distance P away 
from the mode, the median is pulled approximately two-thirds 
P away from the mode in the same direction; that is, 


X — Mo = 3(X — Mi) 
Hence, skewness can also be measured by three times the distance 
between the mean and the median, as follows: 
E 3(X — Mi) 


o 


sk (14a) 


The second equation has the advantage over (14) in that the 
median is often easier to locate than the mode. The mode is 


D 


SUMMARIZATION AND COMPARISON 189 


iently difficult to locate in a sample distribution and is 
subject to wide fluctuations of sampling; in addition, its location 
is often dependent merely upon the selection of the class interval. 

Skewness Measured by the Medians and Quartiles. A simple 
sure of skewness, and one that is easily comprehended, is 
obtained by comparing the location in a frequency distribution of 
its median and quartiles. This can be illustrated by taking the 
same three distributions used above and calculating their 
respective first and third quartiles (the medians are already 
calculated). From Fig. 67 it is seen that in the symmetrical 
distribution the first quartile is smaller than the median by the 


e 


Frequencies 
e 
T 


Fio. 67.—Relation between the first and third quartiles and the median of a 
symmetrical distribution. 


same amount that the third quartile is larger than the median— 
Qı and Qs are equidistant from the median; accordingly, 


Qs — Mi — (Mi — Q) = 0 
If the terms are rearranged, this may be written 
Q, + Q: — 2Mi = 0 


From Fig. 68, the positively skewed case, it is seen that the 
first quartile is smaller than the median by an amount con- 
siderably less than the amount by which the third quartile is 
larger than the median; accordingly, 


Q, — Mi — (Mi — Qı) = +0.222 


> 
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From Fig. 69, the negatively skewed distribution, it is seen 
that the first quartile is smaller than the median by an amount 
considerably larger than the amount by which the third quartile is 
larger than the median; for Qs — Mi — (Mi — Qı) = —0.154. 
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Vic. 68— Relation between the first and third quartiles and the median of a 
positively skewed distribution, 
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Fra. 69.— Relation between the first and third quartiles and the median of & 
negatively skewed distribution. 


Tf the location of the quartiles compared with the location 
of the median, in each distribution, is now compared with half 
the distance between the two quartiles (e, the average dis- 
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tance of the two quartiles from the median), a coefficient or 
relative measure of skewness is obtained and the formula that 
measures skewness is as follows: 


Qs — Mi — (Mi — Qı) _ Q: + Qı — Mi 

sk = 15 

: Q. - Qi Q Ce 
2 

where Q is used as a symbol signifying the semiquartile range. 


Not only is the coefficient of skewness based upon the median 
and quartiles a good one because these are usually easy to find, 
but also this coefficient of skewness has the advantage that it 
has value limits between +2 and —2. It can be no greater 
than +2, that is, when positive skewness is so great that the 
median equals the first quartile. It can be no greater than —2, 
that is, when negative skewness is so great that the median 
equals the third quartile. In Figs. 68 and 69, respectively, 
the coefficients of skewness are +0.146 and —0.106 expressed 
as ratios. Expressed as percentages of skewness, they are 
+14.6 per cent and —10.6 per cent. 

Third Moment as a Measure of Skewness. The cube root of 
the third moment is also a good measure of skewness. This is 
due to the fact that (1) if the distribution is symmetrical, het 
will be zero; but (2) if the distribution is not symmetrical, the 
third moment will not be equal to zero but will have a positive 
or negative value according to whether the distribution is 
skewed positively or negatively. This is illustrated by simple 
examples showing a symmetrical and a positively skewed distri- 
bution. The negatively skewed distribution is left to the 
student to work out. 


1. SYMMETRICAL DISTRIBUTION 


X F x Fe Fe Fas 
1 2 23 —6 18 —54 
2 3 —2 =6 12 —24 
3 4 SH —4 4 —4 
4 5 0 0 0 0 
5 4 1 4 4 4 
6 3 2 6 12 24 
7 2 3 6 18 54 
23 XFx =0 Fr = 08 zF = 0 


Yet 
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2. Posrrivety Skewev DISTRIBUTION 


x F z Fa Fz? Fz 
1 6 —2 —12 24 —48 
2 8 = UNE 8 SCH 
3 10 0 0 0 0 
4 4 1 4 4 4 
5 3 2 6 12 24 
6 2 3 6 18 54 
7 1 4 l 16 64 
34 XFr = 0 Ze = 82 SFr’ = 90 


In the second example, the third moment is equal, not to 
zero, but to $$. The measures of skewness by this method 
would be $$. Symbolically, this measure of skewness is 


Zleboä 

sk = a = ui (16) 
It may be seen from the definition of the third moment | Eqs. 
(9)] that this measurement of skewness is the cube root of the 
third moment. Expressed as a coefficient of skewness, where 
the aggregate amount of skewness is in terms of the standard 


deviation, this measure of skewness is as follows: 


The Beta Coefficients. This last measure of skewness is of 
particular interest, not only because it is based on a wholly 
mathematical procedure (it is not dependent on nonmathe- 
matical summaries like the median and quartiles or the mode), 
but also because it is directly related to one of the so-called 
“beta coefficients." The beta coefficients are functions of the 
moments of the frequency distribution that have been found 
very useful in describing and distinguishing various types of 
frequency distributions.! The two principal beta coefficients 
are 8; and 8s, which are defined as follows: 


It will be noted that the sixth root of 8; is identically the coeffi- 
cient of skewness, sk = ul/c, for us = ei Frequently 1/8; is itself 


1 Bar and Duncan, Sampling Statistics. 
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. 
n as a measure of skewness, the square root being given the 
same sign as 2Fzš from which the third moment is calculated. 
When the beta coefficients are used to describe a frequency 
curve according to a formula developed by Karl Pearson,' the 


° x X-—Mo., 
skewness for this eurve as measured by — eme d found to be 
equal to 


xo MB (Bs + 3) E 
sk = 30656, — 68; — 9) ES 


When the data are such as to warrant the fitting of a smooth 
freyueney curve, this is an excellent formula for measuring 
skewness. The curve, of course, does not have to be fitted to 
make use of the formula as a measure of skewness. 

When 8i is small, i.e., when the skewness is slight, and when 8» 
is »pproximately equal to 3, as it is in the case of a normal dis- 
tribution,? this last equation shows that 4/8; is approximately 
equal to twice — the mode being that of the fitted 
Pearsonian curve. When the latter is calculated by the approxi- 
mate formula Mo = X — 3(X — Mi), the calculation of half 
the square root of 8; will in certain cases serve as a rough check 

X — Mo 
on the computation of skewness from the formula sk = — `: 

The importance of the second beta coefficient lies in the fact, 
that it is a measure of kurtosis. 


KURTOSIS 
Definition. Kurtosis is described by Karl Pearson as follows:? 


Given two frequency distributions which have the same variability 
as measured by the standard deviation, they may be relatively more or 
less flat-topped than the normal curve. . . . A frequency distribution 
may, in other words, be symmetrical, but it may fail to be mesokurtie 
(equally flat-topped with the normal eurve), and thus the Gaussian 
curve eannot describe it. 


1 Cf. Surrii and Duncan, Sampling Statistics, pp. 134-137. 


2 See the next section. 
?"Skew Variation, À Rejoinder,” Biometrika, Vol. 4 (1906), p. 173. 


Cited from H. M. Walker, Studies in the History of Statistical Method, p. 182. 
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The “normal curve” to which this quotation refers is repre- 
sented by the equation 


(18) 


Because this curve has arisen so frequently in statistics and 
because it has been used as a type with which to compare other 


> 
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Fie. 70.— Frequency distributions with greater and with less kurtosis than the 
normal curve. 

frequeney curves, it has come to be known as the normal curve. 

Also, since Gauss early recognized its importance, it is sometimes 

called the Gaussian curve. 

As shown in graphie form earlier in this chapter (Fig. 63), 
when there is a marked concentration of very small variations 
about the central tendency, the frequency eurve rises to à high 
peak, unlike the normal, or Gaussian, curve, which has a certain 

1 €f. pp. 294-295. 
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roundness at the top. Kurtosis is a, measure that makes it 
possible to describe the relative degree to which this character- 
istic exists with reference to any frequency distribution, For 
ihe normal curve, the relationship between the second and the 
fourth moments is as follows: 


If the ratio of the fourth moment to the square of the second 
moment is less than 3, the curve is flatter than the normal curve; 
and if the ratio is greater than 3, the curve is more peaked than 
the normal curve. Figure 70 shows three frequency distribu- 
tions, with 8» equal, respectively, to 2, 3, and 4; the standard 
deviations of the three curves are equal. It should be noted 
that the smaller area at the peak of the flat-topped distribution 
i ompanied by a loss of area at the tails of the distribution. 
loss of area at the peak and the tails is compensated by the 
fact that the curve is higher than the normal curve on each side 


TABLE 10.—CoxPurATION or THREE Frequency Curves! 
B: = 2 B. = 3 B: = 


= Ee elei Ay = dein E RE 
dE 3 E 

(1) (2) (3) (4) (6) rb o) (7) 
0.0 0.3989 | 1.1968, 0.0499, —0.0499, 0.4488 | 0.3490 
0.2 0,3910 | 1.0799 0.0450) —0.0450 0.4360 | 0.3460 
0.4 0.3683 | 0.7607, 0.0317, —0.0317, 0.4000 | 0.3366 
0.6 0.3332 | 0.3231, 0.0135, —0,0135) 0.3467 | 0.3197 
0.8 0.9897 | —0.1247| —0.0052 0.0052) 0.2845 | 0.2949 
1.0 0.2490 | —0.4839| —0.0202 0.0202 0.2218 | 0.2622 
1.2 0.1942 | —0.6925, —0.0288 0.0288, 0.1654 | 0.2230 
1.4 0.1497 | —0.7364, —0.0307, 0.0307, 0.1190 | 0.1804 
1.6 0.1109 | —0,6440) —0.0268 — 0.0268 0.0841 | 0.1377 
1.8 0.0790 | —0.4692| —0.0195 0.0195, 0.0595 | 0.0985 
2.0 0.0540 | —0.2700  —0.0112 — 0.0112, 0.0428 | 0.0652 
2.2 0.0355 | —0.0927| —0.0039 0.0039 0.0316 | 0.0394 
2.4 0.0224 | 0.0362, 0.0015, —0.0015 0.0239 | 0.0209 
2.6 0.0136 | 0.1105, 0.0046 —0.0046 0.0182 | 0.0090 
2.8 0.0079 | 0.1370| 0.0057) —0.0057| 0.0136 | 0.0022 


1 Columns (2), (6); and (7) give the ordinates of the three curves. Column (6) = (2) + 
(4), and column (7) = (2) + (3), account being taken of the signs. The figures are derived 
from the formula for a Gram-Charlier frequency curve. See Surra and Duncan, Sampling 
Statistics, pp. 84, 142-144. 
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between the peak and the tails. In other words, it is flat both 
at the peak and at the tails and is high in the shoulders. In 
contrast, the more peaked frequency curve is higher than the 
normal curve at both the peak and the tails and lower than the 
normal curve on each side between the peak and the tails. Its 
shoulders are lower than those of the normal curve. 


SUMMARY 


There are two ways in which frequency distributions differ 
from the so-called “normal” frequency distribution.’ (1) Fre- 
quency distributions may have a higher or lower peak than the 
normal frequency distribution. ‘This relative flatness or lack of 
flatness of the peak relative to the normal curve is called “kur- 
tosis.” (2) Frequency distributions may have a preponderance of 
variability to the large values or to the small values. This lack 
of symmetry in variability is called “skewness.” The normal 
distribution and the concepts connected with its analysis con- 
stitute a convenient point of departure for the general analysis 
of variability. In this study of variability, the characteristics of 
kurtosis and skewness are of great importance for the reason that 
a large part of the phenomena studied have characteristies pro- 
dueing frequency distributions that are not normal" Even as 
early as the time when Sir Francis Galton was developing his 
theory of correlation (1877—1889), -writers on mathematical 
statistics realized that the univariate normal law of De Moivre 
and Laplace could not be regarded as a universal law of fre- 
quency distribution; the presence of skewness in homogeneous 
material was certainly as common as that of normality.” 

It is important to realize that the function of frequency- 
distribution analysis is not primarily to define and measure 
averages but to define, describe, and measure variability. 
Simple averages have relatively limited uses and may lead to mis- 
interpretation rather than clarification if used without refer- 
ence to the measures of variability, skewness, and kurtosis. 


1 For further description of the normal curve, see Chaps. X and XI. 

2 PRETORIUS, 8. J., Biometrika, Vol. 22 (1930-1931), pp. 109-223. Cf. 
EnpkRTON, W. PALIN, Frequency Curves and Correlation (1927). Rietz, 
H. L., Handbook of Mathematical Statistics, Chap. VII, Frequency Curves, 
pp. 92-119, by H. C. Carver, and pp. 288-239 by W. A. Shewhart. 
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in this chapter some of the most generally used methods of 
analysis of frequency distributions have been presented in 
elementary form, in order to keep clear of the complications of 
practical application. It will now be easier to see how these 
methods are applied and just what complications enter into their 
application to real problems. The following chapter gives an 
example of such an analysis, using data from real life. But it will 
be well to close this chapter with a summary of the symbols that 
have thus far been used and that include by far the majority of 
oll the symbols used in statistics. Most of the symbolic language 
can be learned from this chapter. If they are mastered, the few 
\dditional ones will be easy to learn. 


SUMMARY OF SYMBOLS 
N variable: X3, Xs, 8j eA or in general X 
Frequencies Fa E dE or in general F 
Sum of Z (Greek capital sigma) 
Number of cases N (equals =F) 
Arithmetic mean X 
Deviations from the mean X1, fo Va, « « « , V» OT in general 

x (equals A — X) 


Median Mi 

First quartile Qı 

Third quartile Qs 

Mode Mo 

Deviations from the median 24, ESPERE eg OL AD. general 
x’ (equals X — Mi) 


Geometrie mean G.M. 
Harmonic mean H.M. 
Average deviation A.D. 
Standard deviation c (Greek small sigma) 
Chi square x? (Greek small chi) 
Skewness sk s 
Moments: 
a. About arbitrary origin: 
ya, Vm, Pay + > + or Pn 
b. About the arithmetic mean: 
pi, Bo, Hay + + coo Be 


Kurtosis s» (Greek small beta) 


> 
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Following is a summary of the formulas that have been used in 
this chapter: 


Summary op FORMULAS 
DPX e F 


a) X= ako £= 2X 
(2) >Fz = 0 
(3) Mo = X — 3(X — Mi) 
(slog G.M. = XE 
Xi Xs Gabi Xn E 
©) aw. X G.M. X LOMA 
CuM A 
bol ah 
Sy 
(7) H.M. < G.M. < X 
SEX Spy Ki Ai 
(8) 1 = as pa Ze Rae Ze 
Zb: ZFa* ZFz^ 
ON TT EE ee ee 
(10) uo = 1, m = 0, we = oi 
(11) us = 0, us = 0, u7 = 0, . . . in asymmetrical distribution 
SIF x|Fa' 
(12) A.D. = zd I zm 
SFr? 
(13) « = AN = d 
(14) sk = X Mo lso i 3 - Mi) 
g 
E 
Q 
(16) sk = a} 
17) sk = VB + 3) _ 
MUI SE pi. Oh = Q) 
(x— X>: 


Wies c AE 
c T 


CHAPTER VII 


ILLUSTRATION OF FREQUENCY-DISTRIBUTION 
ANALYSIS 


Data for a Frequency Distribution. Data selected to illustrate 
frequency distribution analysis are presented in Table 11. 
Heights in inches of 300 members of the freshman class of 1943 
were obtained from the records of the Department of Health and 
Physieal Education, Princeton University. As presented in the 
table the data are not arranged in a frequency distribution; 
they are listed at random. In order to make a frequency dis- 
tribution of the data it is necessary first to decide on the size and 
limits of the class interval to be used in the construction of the 
distribution; for the frequency distribution per se is a method of 
summarization compared with the manner of presentation of the 
data in Table 11. 

CONSTRUCTION OF A FREQUENCY DISTRIBUTION 


The Class Interval. Rules for Determining the Class Interval. 
The class interval is the unit of the frequency distribution; in 
other words, it is the size of the groups in which the data are 
summarized, In the data selected for illustration should the 
groups be l-ineh, }-inch, }inch, or yo-inch groups? That is, 
should the class interval be 1 inch, a half inch, a quarter inch, or a 
tenth of an inch? 

A general rule for selecting the class interval is that it should be 
such as to make possible, without serious error, the treatment of 
all values assigned to any one of the classes as if they were equal 
to the mid-point or mid-value of the class. The lower limit of the 
class intervals also should be so selected as to facilitate this end. 
If the cases are concentrated, in fact, at the mid-point of the class 
interval or are evenly distributed throughout, it may, without 
serious errors in calculation, be assumed that all cases in the class 
are equal in value to the mid-value. 

Another guide in the selection of the class interval is that it 
should be as large as possible subject to the first condition and 
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to the condition that the interval should not be so large as to 
conceal too much of the character of the variability. Indeed, the 
most important purpose of the class interval is so to summarize 


‘Taste 11.—Hetaurs, 300 EIGHTEEN-YEAR-OLD MEMBERS OF THE CLASS OF 
1943, Princeton UNIVERSITY 
(In inches) 


I 
71.00 76.50 68.75 74.40 70.50 
71.50 | 69.00 69.50 73.00 | 67.50 | 
70.00 | 67.50 70.00 69.80 | 
67.50 75.00 70.75 63.00 | 
72,50 69.00 67.00 70.28 | 

70.50 70.50 69.70 66.00 

72.00 70.30 13.20 69.00 

2.25 72.60 71.20 70.50 | 

67.25 72.80 68.00 70.50 | 

71.00 71.50 73.50 68.00 67.50 
72.25 70.75 67.75 70.50 72.00 
60.50 75.75 69.50 | 68. | 
78.75 65.50 73.25 70. | 
69.70 70.00 71.00 | 72. 
75.75 70.50 70.75 | 71,00 | 
71.50 72.75 69.20 69.50 | 71.00 
72.00 67.50 73.20 68.50 | 70.00 
67.00 68.75 72.00 71.00 
71.00 69.50 70.75 72.00 | 
65.75 73.00 70.75 68.20 
75.25 71.70 73.50 | 74.20 
67.25 67.50 | 67.00 72.00 
69.50 69.25 74.30 67.00 
70.25 67.00 70.25 68.70 | 
70.10 72.50 71.80 
09.20 68.75 70.10 
08.00 | 70.00 68.10 | 
71.00 | 71.25 73.00 | 
09.00 ` 72.00 71.00. | 
62.00 73.95 | 64.25 
70.00 70.75 | 70.75 
74.00 | | 372.00 | .71.20 
67.00 | 69.00 72.00 
73.50 63.60 72.50 | 
70.50 | 71.00 71.00 | 
71.25 | | 60.50 69.00 
73.00 74.00 07.50 
68.50 | 73.70 72.00 | 
68.00 | 69.50 71.75 
70.00 71.755 | 73.00 
78.20. | | | 

72.80 | | 

69.40 | | 

68.50 | | 

68.50 
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the data in a frequency distribution as to disclose more clearly 
ihe character of the variability. If a very small class interval is 
chosen, the character of the variability will not be visible unless a 


(The latter is also a graph of 


Table 14.) 


a of Table 11 and a graph of the row totals. 
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very large number of cases are measured; if a very large class 
interval is chosen, significant irregularities in the data may be 
concealed. 
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Ordinarily, the size of the class interval should be uniform 
throughout, because different-sized class intervals will compli- 
cate calculations. In some cases, however, it is necessary to 
use different-sized class intervals in order to give a proper picture 
of variability. 

If other more important rules are not thereby violated, in the 
interests of simplicity the position of the class interval in the 
range should be such that the limits of the intervals are integers 
or such that the mid-values of the class intervals are integers. 
Where marked concentration about certain values exists, as is 
sometimes the case in dealing with discrete data, these values 
should so far as possible be made the mid-points of class intervals. 

An Array of the Data. Intelligent determination of the class 
interval is aided by study of the data arranged in an array or 
scatter diagram such as Fig. 71, which is presented to illustrate 
the determination of the proper class interval. In the figure 
the heights shown in Table 11 are arranged in an array. Because 
inspection of the data in Table 11 led to the suspicion that con- 
centration points were present at the +, 3-, and inch values, 
the array is presented in rows with these concentration points 
plumbed. Summing the columns as well as inspection of the 
detail of the scatter diagram show the concentration of fre- 
quencies at these values. 

Frequency Distribution with Too Many Class Intervals. Exami- 
nation of Fig. 71 suggests that a +inch class interval beginning 
at 61.875 inches, as shown in Table 12, might be a good class 
interval for the data of Table 11, for the +inch class interval 
with the lower limits as shown in Table 12 places the mid-values 
of class intervals at points of concentration. Such a frequency 
distribution contains over 60 rows, however, and, in addition, 
is uneven and irregular in appearance. Ten frequencies occur 
in the interval 66.875-; only five frequencies occur in the interval 
68.625-; twelve frequencies occur in the intervals immediately 
below and above 68.625-. Moreover, it is not clear whether 
the modal class interval is 693, 704, 70%, 71, or 72 inches; because 
an equal number (15) have each of these five heights. 

The 4-inch class interval is too small in this instance to dis- 
close clearly the nature of variation in freshman heights. 

A Larger Class Interval Reveals the Character of Variation. If 
1 inch is taken as the class interval, the frequency distribution 
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‘Yasin 12.—Frequency DISTRIBUTION OF THE HEIGHTS or 300 PRINCETON 
FRESHMEN, CLASS or 1943 


Heights of freshmen 
x Number of freshmen having 
specified heights 
F 


Interval | Mid-value 


WecokeZtens--ne-eoccceco-co-oco- 


1 


see SE ebe Deet 


LEE EE meme ION 


a 
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> 
will contain 17 classes and will appear as shown in Table 13. 
In this frequency distribution the lower limits of the class inter- 
vals are so chosen that mid-values are at the 0.625-inch points 
(š inch), which is a balancing center of the concentration points 
at the }-inch intervals because at š inch each mid-value has two 
tinch concentration points below it and two above it in the 


1125 15125 


Height, inches 
Fro. 72, Distribution of heights of 300 Princeton freshmen. (Class interval 


= inch.) 
75 
60 
z 
E 
ER 
0 
È 
15 
0 
65.125 10425 15125 
Height, inches 
Fra. 73,— Distribution of heights of 300 Princeton freshmen, (Class interval = 1 
inch.) 


l-inch class interval. This balancing position of the $-inch 
points ean be readily seen by an examination of Fig, 71. 

In order to contrast the irregularities in the frequency dis- 
tribution using too small a class interval with the regular appear- 
ance of the same: frequency distribution using a larger class 
interval, Figs. 72 and 73 are presented. Figure 72 is à graph 
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of the frequency distribution of heights of 300 Princeton fresh- 
men, using a +inch class interval. Figure 73 is a graph of the 
frequeney distribution of heights of 300 Princeton freshmen, 
using l-inch class interval. 

The argument for a class interval centered at the š-ineh point 
has been based on the assumption that measurements have been 
made to the nearest + inch. In other words, a height recorded 
as 64.25 might be anything between 64.125 and 64.375. If 
measurements were always made to the lowest 4 inch, then some 
other mid-point would be warranted such as the j-inch points, 
or integral values. Table 14 is one based on this assumption. 
Since the exact method of measurement is not known and since 
Table 14 is simplest in form, it is adopted for subsequent analysis. 
^ graph of the distribution has already been shown in Fig. 71. 

in frequency Tables 12 to 14, the class interval has been listed 
in lwo ways. (1) It has been described by writing on each line 
the lower limit of the class interval, followed by a dash. (2) It 


Tse 13.—FRrequency DISTRIBUTION or THE Heute or 300 PRINCETON 
FRESHMEN, Crass or 1943 


Heights of freshmen 
x Number of freshmen having 
rp oo aha) specified heights 

Interval | Mid-value P 

61.125- 61.625 1 
62.125- 62.625 1 
63.125- 63.625 1 
64.125- 64.625 2 
65.125- 65.625 4 
66.125- 66.625 20 
67.125- 67.625 30 
68.125- 68.625 35 
69.125- 69.625 47 
70.125- 70.625 51 
71.125- 71.625 42 
72.125- 72.625 27 
73.125- 73.625 20 
74.125- 74.625 9 
75.125- 75.625 Z 
76.125- 76.625 2 
77.125- | 77.625 1 
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H 
TABLE 14—FREQUENCY DISTRIBUTION OF THE Hercuts or 300 PRINCETON 
FRESHMEN, Crass or 1943 


Heights of freshmen 
x Number of freshmen having 
specified heights 

Interval | Mid-value | £ 
62- | 62.5 1 
63- 63.5 2 
64- 64.5 1 
65- 65.5 4 
66- | 66.5 12 
67- 67.5 31 
68- 68.5 31 
69%- 69.5 47 
70— 70.5 48 
ët 71.5 42 
KEE 35 
73- 73.5 21 
74- 74.5 14 
15- 75.5 8 
76- 76.5 2 
T= 77.5 1 
300 


has been described by writing in the next column the mid-value. 
Obviously, both methods of describing the class interval need 
not always be employed; the conventional procedure is to use 
the lower-limit description rather than the mid-value descrip- 
tion. ‘The mid-value can always be calculated by adding one- 
half the class interval to the lower limit of the class interval. 


8 


WORK SHEET FOR FREQUENCY-DISTRIBUTION ANALYSIS 


The frequency distribution having been constructed, the 
procedure for frequency-distribution analysis will now be 
described. Table 15 is a work sheet for the analysis of a fre- 
quency distribution; in columns (1) and (2), under X and F, 
is copied the frequency distribution from Table 14. Entries 
in the remaining columns will be explained below. The work 
sheet is so constructed that advantage may be taken of certain 
economies in calculation, These economies arise from two 
sources: (1) the reduction in calculation, due to the use of a short 
method that involves the calculation of the moments about an 
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. 
“arbitrary origin" and (2) a reduction in calculation, due to 
the use of class intervals as units of deviation from the arbitrary 
origin. 

Saving Calculation by Obtaining Moments about an Arbitrary 
Origin. In applying the short method, an arbitrary origin, 
which may be called A, is selected. While zero may be taken 
as an arbitrary origin (and often is in certain statistical problems), 
in the analysis of frequency distributions the amount of caleula- 
tion is reduced by selecting a value for A somewhere near the 
middle of the range. The moments about the arbitrary origin 
are then caleulated by measuring deviations from A in class- 
interval units, that is,ind/?s. Sometimes d’ is used to symbolize 


Ga The savings in calculation are due to the fact that all desired 


mathematical statistics can then be computed by the use of 
formulas from the four moments about the arbitrary origin. 

Saving Calculation by Using Class-interval Units. Saving in 
the amount of ealeulation to obtain the various statistics results 
if the class-interval unit is used, particularly if the variable is in 
complex or fractional units or in large numbers. This economy 
is brought about by expressing the deviations in terms of class 
intervals instead of in original units, Ze, in dis instead of in 
d’s. As pointed out above, this saving is augmented by selecting 
the arbitrary origin near the middle of the frequency distribution. 
If the arbitrary origin is at or near the middle class interval, the 
largest deviation in terms of class-interval units will then be no 
greater than half the number of class intervals in the frequency 
distribution. Since the deviations must be raised to the fourth 
power in order to calculate the fourth moment, substantial saving 
in calculation is secured by keeping class-interval deviations as 
small as possible by placing the arbitrary origin near the middle 
of the frequency distribution. It will be observed in Table 15 
that the frequency distribution has been copied on the work 
Sheet in such a position that the arbitrary origin is near the middle 
of the frequency distribution. It can also be seen that, when 
the class interval is uniform in size, recording the class-interval 
deviations in column (3) is merely a matter of proceeding by 
count above and below the arbitrary origin, that is, —1, —2, 
—3, ete., for successive smaller class-interval values, and 1, 2, 3, 
etc., for successive larger class-interval values. 
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. 
Entering the Frequency Distribution on the Work Sheet. The 
frequency distribution of freshmen heights shown in Table 14: 
has 16 class intervals; and if the mid-value of the central class 
interval is to be selected as the arbitrary origin, the first class 
interval, 62-, will be entered in column (1) under “Interval” 
on the line opposite —7 in column (3) (d/i = —7). The remain- 
ing class intervals will be entered in succeeding lines until 77— 
will be opposite 8 in column (3) (d/i = 8). The mid-value 
of the central class interval is 69.5, which is opposite 0 in column 
(8) (d/i = 0). The corresponding frequencies are then entered 
in column (2). Full description of the data and their source 
is entered in the space provided at the top of the work sheet. 
Saving Calculation in Use of Work Sheet. The amount of 
calculation involved in the entries required for columns (4) to 
(9) can be reduced to a minimum by the following procedure: 

In column (4), headed F(d/i), enter the class-interval devia- 
tions multiplied by the frequencies [7.e., items in column (3) 
multiplied, respectively, by items in column (2). The algebraic 
sum of the figures in column (4), divided by N, equals the 
first moment (in class-interval units) about the arbitrary 
origin. 

The figures in column (5), headed F'(d/7)*, are obtained by 
multiplying the items in column (4), respectively, by the corre- 
sponding items in column (3). The sum of figures in column 
(5), divided by N, equals the second moment (in class-interval 
units) about the arbitrary origin. 

The figures in column (6), headed F(d/i)* are most easily 
obtained by multiplying the items in column (5), respectively, 
by the corresponding items in column (3). The algebraic sum 
of figures in column (6), divided by N, equals the third moment 
in class-interval units) about the arbitrary origin. 

The figures in column (7), headed F(d/i)*, are obtained by 
multiplying the items in column (6), respectively, by the corre- 
sponding items in column (3). The sum of figures in column 
7), divided by N, equals the fourth moment (in class-interval 
units) about the arbitrary origin. 


The figures in column (8), headed E E i) are obtained 


by adding 1, respectively, to each figure in column (3) and raising 
he result to its fourth power. All figures in this column are 


D 
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readily obtained by using a table of powers of numbers.! The 
sum of column (8) is not used. 


The figures in column (9), headed F ( + i); are obtained 


by multiplying the items in column (8), respectively, by corre- 
sponding items in column (2). The sum of column (9) is used 
to check the arithmetical accuracy of all calculations in the 
work sheet. 
When the work sheet is completed, it will show the following 
alues: 


2 3 4 
A, i, N, zP (3) zF (4) E ($) Wand or (’) 
ï i i i ; 


In addition, by means of columns (8) and (9), the work sheet 
provides a cross check on its internal calculations, since the 


4 
expansion of EF G + 1) gives the following terms: 


(2X sp( 2 sp 
F (8) + sp (4) + ozr (2 +azr (4) + ui 


It follows that on a correctly constructed work sheet the sum 
of column (9) equals the sum of column (7) plus four times the 
sum of column (6) plus six times the sum of column (5) plus four 
times the sum of column (4) plus the sum of column (2). This 
salled a “Charlier check” after the name of the man who first 
suggested its use as a checking device. 

For Table 15 the Charlier check is as follows: 


E [column (2)] = 300 
43 [column (4)] = 4 X 292 = 1,168 
6> [column (5)] = 6 X 2,140 = 12,840 
4Z [column (6)] = 4 = 5,590 = 22,360 


E [column (7)] = 45,088 
Sum = 2 [column (9)] = 81,75 156 


1 Cf. Mathematical Tables from Handbook of Chemistry and Physics, pp. 
153-173. For use in making calculations there are a number of convenient 
devices such as the slide rule and calculating machines, as well as logarithms. 
There are also several useful printed tables such as Barlow's Tables of 
Squares, Cubes, Square-roots, Cube-roots, and Reciprocals of Integers up to 
10,000 and Karl Pearson’s Tables for Statisticians and Biometricians; A set 
of logarithms will be found in Appendix, Table I. 
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D 
TABLE 15.—Worxk SHEET ror MAKING CALCULATIONS IN THE ANALYSIS 
or A Frequency DISTRIBUTION 
Duscriprion or Dara: Heights of 300 Princeton University Freshmen, 

Class of 1943 
Source or Dara: Princeton University's Department of Health and Physi- 
cal Education 


i-lin. 
A = 09.5 in. (Mid-point of class interval near center of distribution) 
a) (2) | 8) (4) (5) (6) m (8) (9*) 
x 
d d dV? V (| (PN) (d H d ‘ 
SERO S OOOO G+) G+) 
Interval] point 
+ = 
'-12| 
ET 
-10| 
-9 
-8 
62- f ht 49 —343 2,401) 1,296 1,296 
63- 2) -6| —12 72| —432| 2,592 625 1,250 
64— TIED ED 25| —125| 625 256 256 
65- 4 -=4) —16 64| —256| 1,024 81 324 
66- 12, -3| —36 108| —324| 972 16 192 
67- 31| -2| —62 124| —248 496) 1 31 
68- 31| -1| —31 31| —31 31 0 0 
69- | 69.5 | 47| O 0 0 0 0 1 47 
70- 48, 1 48 48) 48 48 16 768 
71- 42, 2| 84 168| 336) 672 81 3,402 
72- 35| 3| 105 315| 945| 2,835 256 8,960 
73- 21; 4| 84 336 1,344) 5,376 625 13,125 
74- 14. 5) 20 350) 1,750) 8,750) 1,296 18,144 
75- 8| 6| 48 288) 1,72810,368 2,401 19,208 
76- EWEN 98| 686 4,802) 4,096 8,192 
Ts D 8 8 64, 512 4,096| 6,561 6,561 
9 
10 
11 
12 | 
> 300 292 | 2,140) 5,590 45,088 >< | 81,756 


* Columns (8) and (9) are for checking purposes. [ column (9)] = Z[Column (2)] 4 
42{Column (4)] + 62[Column (5)] + 4Z[Column (6)] + (Column (7)]. 
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Moments about the Arbitrary Origin. The moments about an 
arbitrary origin can be quickly calculated from the sums of 
columns (4) to (7), because by definition the moments about an 
arbitrary origin are as follows: 


= Fd 
Um 
_ Fd: 
yo = N 
Ska 

UE E N 
xrd: 

n = N 
_ Fa" 
Di N 


where X — A = d. 

If A were zero, d would equal X; and the moments would then 
reduce to the form shown in Chap. VI. 

When, as in Table 15, the deviations have been taken in class- 
interval units rather than in original units, the formulas for the 
moments about an arbitrary origin, would be written as follows:! 


Kee 
2 
SR 
Deier $ 
Die (1) 
L = IN. ST 
dë 


where X — A = 7 (i), in which 7 is the class interval. 


. Cf. p. 181. The prime on > means that the visin class-interval units; 
Le, v = vfi, v, = vifi, ete. 
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Accordingly, the moments in class-interval units about an 
arbitrary origin are obtained from the sums of columns (4) to (7) 
of the work sheet by dividing each by N [the sum of column (2)]. 

In Table 15, the moments about the arbitrary origin in class- 
interval units are as follows: 


4- i = 0.97333 
d De = 7.18333 
n= 5595 = 18.63333 
AS SH = 150.29333 


Moments about the Arithmetic Mean. When the moments 
about an arbitrary origin are obtained, the moments about the 
mean are obtained from the following equations:' 


By a 4 — 4 = 0 

H 
H4 — Va — Er 9 
uL aL BN Ey @) 
u, = vf — Avy, + 6 — Seil 


The moments about the arbitrary origin having been calcu- 
lated for the frequency distribution of freshmen heights in 
Table 15, the moments about the arithmetic mean in class- 
interval units may now be obtained by the use of Eqs. (2), as 
follows: 


ui = 0.97333 — 0.97333 = 0 é 
u, = 7.13333 — (0.97333)? = 7.13333 — 9.94737 = 6.18596 
ui = 18.6333 — 3(7.13333) (0.97333) + 2(0.97333)* 
= 18.6333 — 20.82924 + 1.84420 = —0.35171 
ui, = 150.2933 — 4(18.63333) (0.97333) + 6(7.13333) (0.97333)? 
— 3(0.97333)4 
150.29333 — 72.54552 + 40.54740 — 2.69253 = 115.60268 


Equations (2) for finding the moments about the mean from 
the moments about an arbitrary origin may be proved as follows: 


/ NEY, eg y o 
ly, = m/i, p = u/i, Uy = Galft, ete 
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Since, in Eqs. (1) for moments about an arbitrary origin, 
i(d/i) = X — A, it follows that 


ZOL = F(X, — A) = FX, — F,A 


det F(X — A) = FiX4 — F&A 


re) X, cy dX, EA 


By adding, 


(a) zF (À ; 2 ZFX — NA 
since DF = N. 

Because A is a constant, di, Gs, . . . , dn will vary in propor- 
tion as Xi, Xs, . . . , X, vary. Also, since A is a constant, the 


sum of the A’s may be written as the constant multiplied by the 
total frequencies, or NA. If now Eq. (a) is divided by N, 


>>} 
0) yO = SN. 
But, by definition, 
Bä 
will mn 
and 
ax = X, the arithmetic mean 


Therefore, by substitution and transposing, Eq. (b) becomes 
X = Ak 0 or X=A+ vi (3) 


Accordingly the arithmetic mean of the frequency distribution 
of 300 freshmen heights shown in Table 15 is as follows:! 


1 The result of calculation is 70.47333; but since the beginning data were 
significant to only two places beyond the decimal, the figures beyond .47 
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X = 69.5 + 0.97333, since i = 1 
= 70.47 in. 


It has thus been proved that the arithmetic mean equals any 
arbitrary quantity plus the first moment about that arbitrary 
quantity. In other words, the arithmetic mean of a series of 
magnitudes is equal to any arbitrary quantity plus the mean of the 
deviations from the arbitrary quantity. From Eq. (3) and from 
the fact that d = X — A, it follows that A = X — d and that 
X=X-—d+ x. Therefore, X — X =d — v, and 


(c). ss d r 


or if d is in class-interval units, 


= (e - 4)i 


This value for z may be substituted in the equations defining the 
moments about the mean, as follows:! 


S >F G = D 
-5 =y 0 


N -N 
ate 
Ze H à 
He N EE (7?) 
d 3 (4) 
=F a Be G: = D 4 
m= BE = A — 65 
d 4 
Kl KEE 
bei >. :— yj 
T NT s 


are not significant. The manner in which the figures are written in Table 
11, which was taken from the source of the data, indicates accuracy to two 
decimal places. Had the numbers been rounded off to the nearest inch, 
the caleulated mean would have significant figures to the nearest inch. 
Nevertheless, if the value of the mean is to be used for making further mathe- 
matical caleulations to obtain other statisties, it should be carried out to 
several more decimal places in order to give an accurate result to two places 
in the additional statistics. 
1 For definition of moments about the mean, ef., p. 181. 
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After expanding and collecting like terms, these equations become 


Ee N 
Bo 
jig ae (2 ="a)* 
3 : 2 
xp (4) XE gd 6) 
Ha Z (0—3 N OK + 2(v)* 


2 v Bes 


Dich 
+ 6 4 G3) — 303* 


From values given for vı, v», vs, and v, in Eqs. (1), Eqs. (5) may 
now be written as follows: 


u-7»-»n-0 

Js ey vi 

us = vs — 3v + 25 

ua = v, — Juan + Ova] — 39i 


which it was said at the beginning of this section would be proved. 
An important corollary follows from the above derivation of the 
second moment (or “variance,” as it is sometimes called). 
Since 


2 
Lyc Zen 


it follows that the mean square deviation about the mean of the 
observations is less than the mean square deviation about any 
arbitrary quantity; that is, the mean square deviation (c?) about 
the mean is a minimum—smaller than it would be if calculated 
from any other average. This is obvious from the equation; 
since »? is a positive quantity, being a square, us must be less 
than vs. 

The Standard Deviation. The standard deviation about the 
arithmetic mean may now be quickly calculated, since it is by 


> 
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definition the square root of the second moment. For the 
frequency distribution of heights of 300 freshmen the standard 
deviation is as follows: 


7 = puit = 2487 or 2.49 in. 


Since the moments were calculated in class-interval units (see 
page 212), this result is also in class-interval units. The standard 
deviation in original units is found by multiplying by 7. In the 
present problem, 7 = 1; hence, e = o/i = 2.49 in. 

The Beta Coefficients. For the frequency distribution of heights 
of 300 freshmen, the first two beta coefficients are as follows: 


2 
Bi = Ë = 0.00052 
Ha 
Bs = Ëš = 3.02102 
He 

Since the betas are ratios having 7 raised to the same power 
in both numerator and denominator, the fact that the moments 
are in class-interval units instead of original units may be 
disregarded. 

Measures of Skewness and Kurtosis. Measures of skewness 
and kurtosis are also readily determined from the moments about 
the mean. In the frequency distribution of heights of 300 
freshmen, the measure of kurtosis, 8», calculated above, is 3.021, 
slightly larger than 3. Hence the frequency distribution is 
somewhat less flat-topped than the normal curve.! 

Skewness in heights of the 300 freshmen, measured by the 
cube root of the third moment, is —0.7057 class intervals. 
Since 7 = 1 in., this is —0.7057 inch. 


CALCULATION OF OTHER STATISTICS 


Averages and Measures of Variability. Difficulties in Locating 
the Median and the Mode. Consideration of the median, the 
mode, and the quartiles has been left to the last for the reason 
that, in the analysis of frequency distributions with class inter- 
vals, these values must be estimated. By definition, the median 
is the value at the center of the distribution, the first quartile 


1 Figure 101, p. 295, is a graph comparing the frequency distribution 
with the ideal normal curve. 


e 
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e 
is the value midway between the lower limit of the range and the 
median, and the third quartile is the value midway between 
the median and the upper limit of the range. The mode is the 
value that occurs with the greatest frequency. The calculations 
of these statisties are not based on the work sheet shown in 
Table 15. 

Because they are concealed in the class interval among a group 
of other cases in the same class interval, the quartiles, the median, 
and the mode must be obtained by estimation. Where within 
the range of the class interval is the median? Where within 
the range of the class interval with the largest frequency is the 
mode? These questions have to be answered by interpolation, 
and the value so obtained becomes an abstract quantity—as 
abstract and mathematical in character as the mean, but without 
the latter’s precision. 

The Mode. In the case of the mode, a further difficulty arises 
in finding the correct answer to the question: Which class 
interval should be considered to contain the mode? If different- 
sized class intervals are taken in each of several frequency dis- 
tributions of the same data, the modal class interval will be 
observed to shift around. "The mode, by definition the simplest 
of the several measures, is actually the most difficult average 
tolocate. Its accurate computation is more highly mathematical 
than that of the arithmetic mean. If à Pearsonian curve gives a 
good fit to the data, the ideal method of obtaining the mode is to 
find the mode of this eurve. A formula for this is given on the 
next page. The disadvantage of this method is that there is no 
way of telling whether a curve is a good fit or not until it is 
actually fitted, and this involves a considerable amount of calcu- 
lation just for the sake of finding the mode. 

But simpler measures of the mode are often used. "These 
are interpolated values, on the assumption that the mode lies 
in the modal class interval, that is, the class interval that has 
the highest frequency. It is assumed that the general shape 
of the distribution affects the distribution of eases at the point 
of greatest eoncentration in the following manner: All the fre- 
quencies below the modal class interval are pulling the mode near 
the lower limit of that class interval, and all the frequencies 
above the modal class interval are pulling the mode toward the 
upper limit of the interval. "The mode is equal to the lower 
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limit of the modal class interval plus the interpolated part of 
the class interval established by the relationship of the frequencies 
above and below that class interval. In the frequency distribu- 
tion of freshmen heights (Table 15), the modal class interval, 
that is, the class interval with the greatest concentration of cases, 
is 70-. There are 129 cases pulling the mode toward the lower 
limit of the class interval 70-, and 123 cases pulling the mode 
toward the upper limit. Consequently, 


Mo = 70 + 434 X 1 = 70.488 or 70.49 in. 

The so-called ‘mathematical mode,” an approximation of the 
mode of the Pearsonian curve that is invalid if the frequency 
curve is very skewed, is calculated from the following equation:' 

Mo = X — 3(X — Mi)* 

For the frequency distribution of 300 freshmen heights, the 
mathematical mode is calculated as follows: 

Mo = 70.47333 — 3(70.47333 — 70.4375) = 70.366 or 70.37 in. 


The mode of the Pearsonian curve fitted to the data is given 
by the formula: 


Mo = X — osk 
vB: (62 +3) ` 
2(582 — 681 — 9) 
For the frequency distribution of 300 freshmen heights, the 
mode calculated by this equation is as follows: 


Mo = 70.50 in. 


where sk = 


The Median and the Quartiles. Determination of the median 
and the quartiles by interpolation is reasonably accurate if, as 
it is assumed, the cases are evenly distributed within the class 
interval containing the median and the two quartiles, respec- 
tively. The calculation of the median and the quartiles is 
facilitated by making a column of cumulated frequencies as 
shown in Table 16. The median is equal to the lower limit of 
the class containing the N/2th case plus an interpolated amount 
within the class interval determined by the ratio of the fre- 


1Cf. p. 173. 
* The median is 70.4375. Cf. the next section. 
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quencies in the interval to the balance of frequencies necessary 
to make up N/2 frequencies. In the frequency distribution of 
freshmen heights (Table 16), N/2 = 150. The frequencies 
are counted cumulatively from the lower limit of the first class 
interval (top of the table). By this count, there are 129 cases 
to the lower limit of the class interval 70-. When the point 
70 is reached on the quantity scale, 129 cases have been counted; 
but the median is the value of the 150th case, that is, 21 cases 
beyond 70. From 70 to 71 there are 48 cases. Consequently, 
the ratio of interpolation within the class interval is $$. Accord- 
ingly, the estimate of the median in freshmen heights is as 
follows: 


Mi = 70 + #š X 1 = 70.4375 or 70.44 in. 


Tarte 16.—Frequency DISTRIBUTION op THE Heen or 300 PRINCETON 
FRESHMEN, Crass or 1943 
(In inches. Class interval 1 in.) 


x F | Cumulative F 
62- 1 1 
63- 2 3 
64- 1 4 
65- 4 8 
66- 12 20 
67- 31 51 
68- 31 82 
69- 47 129 
70- 48 177 
71- 42 219 
72- 35 254 
73- 21 275 
74- 14 289 
75- 8 297 
76- 2 299 
fi: it 300 

300 


1, There are 129 cases to X = 70. 

2. Since N /2 = 150.0, this leaves 150.0 — 129, or 21.0 cases to go, of the 
48 cases in the next class interval (70-71). ç 

3. The interpolated amount of the class-interval range is therefore 
i XI. 


> 
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The third and first quartiles are calculated by interpolating 
in a similar manner for the values of the 3N /4th and the N /4th 
cases. In the frequency distribution of the heights of 300 
freshmen, following are the values of the quartiles: 


Qı = 68 + 44 X 1 = 68.774 or 68.77 in. 
Qs = 72 + 3% X 1 = 72.171 or 72.17 in. 


The Average Deviation. The average, or mean, deviation 
is a measure of dispersion that has its minimum value when 
deviations are measured from the median. To compute the 
average deviation from the median, subtract each of the N values 
of X from the median, add the absolute values of the deviations, 
and divide by N. Thus, 


EM 
AD. Ze (6) 


The average deviation is simpler in concept than any other 
measure of dispersion. It is less affected by extreme deviations 
than the more popular standard deviation, and for this reason 
it probably has greater sampling reliability from extremely 
leptokurtic universes. In spite of these advantages the average 
deviation is not a popular measure of dispersion, partly because 
of several widely accepted but mistaken notions concerning its 
properties. 

It is often said that it is illogical to neglect the signs of devia- 
tions to be averaged and that this fallacy is avoided in the case 
of other measures of dispersion. It is true that the mean devia- 
tion from the median is the mean of absolute deviations from 
some average, but every other measure of dispersion is also equal 
(or proportional) to an average of absolute deviations from some 
average. The quartile deviation is the median of absolute 
deviations from the mid-quartile, and the standard deviation 
is the quadratic mean of absolute deviations from the mean. 

It has been said that the sampling reliability of the average 
deviation is less than that of the standard deviation. This may 
be true for normal universes, but it can hardly be true for all 
types. 

Grouped Data—Mid-value Assumption in Calculating Average 
Deviation. When data are grouped in the form of a frequency 
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distribution with equal class intervals, the average deviation can 
be written in the simple form 


ap, = 2A " 


where d is the deviation of class mid-values from the mid-value 
of the class interval containing the median. Although Eq. (7) 
is the exaet value of the average deviation from the median 
according to the assumption that all observations in every class 
interval are equal to the mid-value of the interval (the same 
assumption commonly used for the standard deviation), many 
statisticians consider it unsatisfactory as a formula for the 
average deviation. The chief reason for the dissatisfaction seems 
to be that the mid-value assumption, which implies that the 
median is the mid-value of the median interval, is inconsistent 
with the ordinary notion of the interpolated median. 

in applying the simple formula in practice, several corrections 
may be used, some of which will be illustrated below. Each of 
these corrections deals with a separate aspect of the problem of 
approximating the average deviation of ungrouped data from a 
frequeney distribution. The two most important corrections 
are usually of the same order of magnitude, but opposite in sign, 
so that they tend to offset each other. For this reason, it is, 
usually advisable to use the simpler formula without correction, 
because of its simplicity, unless the problem is of great importance 
so that minor adjustments are worth making. 

Grouped Data—Histogram Assumption in Calculating Average 
Deviation, The average deviation of the histogram considered 
as a continuous frequency function is often used in preference 
to the simple formula for the average deviation presented in Eq. 
(7). This corresponds to the assumption on which the usual 
interpolated median is based. The median is the abscissa of 
the vertical line that divides the histogram into two equal 
areas. When the left half of the histogram is folded along this 
vertical line, over the half on the right, the average deviation 
is the first moment about the line of folding. 

To simplify the derivation, let d/i represent deviations from 
the mid-value of the median interval, and let 


Mi = L + ci (8) 
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where L is the lower limit of the median interval, 7 is the width 
of the class interval, and e is the proportion of observations in 
the median interval that are less than Mi. It is to be noted 
that the cases are assumed to be distributed uniformly through 
the interval. 
In these terms the formula for average deviation can be written 
as follows: 
z e | dei ef or, 
Ho oO d : a ` (9) 
in which Fo is the frequency of the median interval, C; is the 
amount of correction associated with observations above and 


A.D. = i 


N ' ' N. 
-cFo observed ë 1⁄2-, l Fo Z To (I-c) observation 
bow lower limit MESS 2m above upper limit 
AL. 
RE 
L cé N27 
Mi 

Midvalue 

of interval 
Fia. 74.—Illustration of distribution of cases in and above and below the median 

interval. 


below the median interval, and C» is the amount of correction 
associated with the median interval itself. 

To demonstrate the truth of this equation, consider the 
diagram of the median interval shown in Fig. 74. Since devia- 
tions from the mid-value of the median interval are (4 — c)? 
too small for observations above the median interval and ($ — c)? 
too large for those below the median interval, it follows that 


SCH N N 

(1- JE —( op, D all (10) 
i(& — c)(2e — 1)Fs = iF (2c — 2c? — 3) 

The area in the median interval below Mi is cf, and its mean 


deviation from Mi is ci/2. Similarly, (1 — ¢)Fo lies above Mi 
with a mean deviation of (1 — c)i/2. Hence, 


C1 


lI 


lI 


Ce ol? GER En — af, 


mertz] im (e e+3) (11) 
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From Eqs. (10) and (11), the combined corrections are found to be 


C1 + Cs = iF o(2e — 2c? — 3) + be — e + $) 
= (ale — c?) = ZE) — c) (12) 


a result that verified Eq. (9). Equation (9) is probably the most 
convenient form available for computing the mean deviation 
according to the histogram assumption. 

Calculation of the average deviation by the use of Eq. (9) 
is illustrated by Table 17 and the ensuing analysis. 


TABLE 17.—Frequency DISTRIBUTION op THE HEIGHTS or 300 PRINCETON 
FnESsHMEN, Crass or 1943 
(In inches. Class interval 1 in.) 


x r 1 Wo 
1 | -8 -8 
2 | -7 —14 
1 | _6 —6 
4 | -5 —20 

12 | —4 —48 
31 | -3 —98 
31 —2 —62 
47 | -1 — 47 
48 0 0 
42 1 42 
35 2 70 
21 3 63 
14 4 56 
8 5 40 
2 6 12 
1 7 7 
300 —298 
4-290 

> (without regard 

to sign) = 588 


LLLA ee eS — aa 


When the median and the quartiles were calculated, it was 
assumed that the frequencies were evenly distributed in the class 
intervals. This assumption is continued while calculating the 
average deviation about the median. As shown in Table 17, 
the sum of the deviations about the arbitrary origin without 
regard to sign is 588, That is, 


ə 
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e (Ò) = sss 
where d= Xi — A 
d; =X. — A 


dy = X. A 


The sum desired is the sum without regard to sign of deviations 
from the median. That is, 


where z = X, — Mi 
xr, = Xa — Mi 


" A ; 
ess X, = Mi 
Nore: z has been used to symbolize the deviations from the arithmetic 
mean; z” is used to symbolize deviations from the median. 


Accordingly, the above sum, 588, which for the illustration 
chosen is xiF(d/i)| can be adjusted by a caleulated correction 
that will change the sum to Z| / ol This correction is 
obtained by using Eq. (9). 

From Table 17 and the analysis on pages 221 to 223 it is to be 
noted that Fo = 48, the frequency of the interval containing 
the median; 7 = 1; and c = 0.44, since the median is 70.44, the 
lower limit of the interval containing the median is 70, and c is 
the proportion of observations in the median interval that are 
less than the median. Accordingly, the average deviation may 
be calculated by using Eq. (9), as follows: 


d xirdl + ca — or, 
A Dues ee! y J 
1[588 + 0.44(0.56)48] 
vies bal’ EIER 
_ 588 + 11.83 
== EN 
= 2.00 in. 


The Semiquartile Range. The semiquartile range, one-half 
of the difference between the third quartile and the first quartile, 
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e 
is another statistic that measures variability. Its formula is 


(= 


For the frequency distribution in Table 15, the semiquartile 
range is calculated as follows: 


q = 21713 = EE EE 


Measures of Skewness. From statistics measuring variation 
and central tendencies, important measures of skewness are 
obtained, It has been noted that X — Mo is a measure of 
skewness. In the frequency distribution of 300 Princeton 
freshmen heights, 


X — Mo = 70.473333 — 70.36584 = 0.10749 or 0.11 in. 


The position of the first and third quartiles in relation to the 
median is a very convenient statistic measuring skewness, 
namely, 


Qs — Mi — (Mi — Q) or Qi + Qs — 2Mi 


For the frequency distribution of heights of 300 Princeton. 
freshmen this statistic is 


68.7742 + 72.1714 — 2(70.4375) = 0.07 in. 


COEFFICIENTS OF VARIABILITY 


The various aggregative measures of variability may con- 
veniently be expressed as relatives or coefficients, as explained 
in the preceding chapter; indeed, they must be so expressed if 
comparisons are to be made with other frequency distributions 
having different types of units. The aggregative measures of 
variability are converted into relatives or coefficients by dividing 
the former by the mean, the median, or the average of the two 
quartiles. For the present problem, the relative measures of 
variability that would be useful in comparing this frequency 
distribution with other frequency distributions, are as follows: 
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a 


V, = Š = 3.53 per cent 
x 
A.D. 4 
Van. = SEs oe 1.38 per cent 
Va = ë == ë = 2.41 per cent 


The formula for the Vo is really the semiquartile range divided 
by the average of the two quartiles, but the 2’s cancel out, leaving 
merely the difference between the two quartiles in the numerator 
and their sum in the denominator. 


COEFFICIENTS OF SKEWNESS 


Statistics measuring skewness are likewise more significant 
for comparative purposes when expressed as coefficients. The 
various coefficients of skewness for the frequency distribution 
in Table 16 are as follows: 

Based on mathematical statistics: 


1 SEXUS 
D — 0.32764 in. - 
v "Sie ^ —0.1317 or —13.17 per cent 
A NÉI (G2 +3) _ 
2(582 — 08 — 9) 
Nore: This is given the negative sign because the third moment is 
negative. 


0.0112, or — 1.12 per cent 


Based on other statisties (using Mo — 70.488): 


X-—Mo _ —0.015 in. 


Ee Ae 


= —0.006 or —0.6 per cent 


(If the so-called “mathematical mode,” i.e, Mo = 70.366, is used, this 
coefficient of skewness by the same formula would be +4.32 per cent.) 


Using the median and the two quartiles to measure skewness, 
the following result is obtained: 


Q: + Q: — 2Mi  +0.0706 in. 
US N = n = +0.0416, or +4.16 


sk 


per cent 

The difficulty of locating the mode, even when quite a large 
sample is taken, is illustrated by the frequency distribution 
analyzed in this chapter. In this illustration every mathematical 
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indication is that the mode is larger than the mean, but the non- 
mathematically calculated mode (the interpolated mode) is 
smaller than the mean. 


GRAPHIC INTERPRETATION OF STATISTICS OF VARIABILITY 
AND SKEWNESS 


Figure 75 shows on a scale the relative location of the median, 
the two quartiles, and the upper and lower limits found by taking 
Mi + A Dan, namely, 70.44 + 2.00. The figure illustrates the 
fact that, when there is skewness, the location of the quartiles 
with reference to the median reflects the presence of skewness. 
If, therefore, the quartiles are used as measures of deviation, 
they refleet the fact that, in skewed distributions, the deviation 


68 9 
6844 1044 1244 
6877 12.17 
Vic. 75.—1llustration of significance of average deviation and two quartiles us 
measures of dispersion. 


is skewed in one direction or the other. If the average deviation 
is used as a measure of deviation or variability, the presence of 
skewness will not be noted in the results. Whenever the 
distribution is skewed to any extent, the quartiles are unequal 
distances from the median, as may be noticed in Fig, 75. As the 
figure also illustrates, the average deviation is conceived as an 
equal distance on each side of the median. 

Figure 76 shows on a scale the relative location of the median 
and average deviation and the location of the mean and the 
standard deviation by depicting the upper and lower limits of 
Mi + A Dan (as in Fig. 75) and, in addition, the upper and 
lower limits of X + ø. As in the case of the average deviation, 
so also in the case of the standard deviation, the measure of 
variability is conceived as an equal distance above and below 
the mean—that is, an equal distance from the mean on the 
z-axis in both the positive and the negative directions in Figs. 


° 
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75 and 76. If the distribution is skewed to a marked extent, it 
should be evident that care must be exercised in interpreting the 
significance of the average deviation or the standard deviation. 
From Fig. 75 it is noted that the first quartile in the negative 
direction and the third quartile in the positive direction are less 
distant from the median than +A.D.m By definition, the 
limits of the range between the first and third quartiles include 
exactly 50 per cent of the cases. For a normal distribution! the 
distance between the upper and lower limits defined by 
Mi + A Dan include approximately 58 per cent of the cases. 


— 
61 68 69 1 
6844 Daat n Ton R d 


67.98 1041 1236 
Fia. 76.—Illustration of the standard deviation and average deviation as meas- 
ures of variability. 

Tt will be noted from Fig. 76 that the limits X + ø are farther, 
respeetively, in the positive and negative directions from the 
mean than are the limits Mi + A.D. from the median. The 
standard deviation is always larger than the average deviation; 
in fact, an approximate check? on the accuracy of calculation may 
be used as follows: A.D. = 0.8c. In the frequency distribution 
illustrated, this check works fairly well; for 0.8(2.49) = 1.97 and 
the calculated A. Dan = 2.00. For a normal distribution the 
distance between the upper and lower limits defined by X + c 
includes approximately two-thirds (68 per cent) of the cases.? 


FREQUENCY DISTRIBUTIONS WITH UNEQUAL CLASS INTERVALS 
As remarked earlier in this chapter, the size of the class interval 
should ordinarily be uniform throughout a given frequency 


18ee Chap. XI for description of a normal distribution. 

* For more precise discussion and explanation, see Chaps. XI and XII. 

3 For distributions that depart widely from the normal form, this check 
may not be satisfactory. 


ES? 
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distribution; but in some cases, usually because there is a large 
concentration of cases at one or the other extreme of the range, 
it is considered necessary to use different-sized class intervals for 
parts of the frequency distribution in order to give a proper 
picture of variability. Table 18 illustrates such an instance. 
Of 150 cases distributed over the range 0-51, 106 cases fell within 
the limits 0-10. Obviously, a small number of class intervals 
of uniform size would give a wholly erroneous notion of the 
variation. Occasionally, data at its primary source will be 
published in a manner similar to that of Table 18, and the 
statistician has no choite but to utilize the material in frequency 
distributions that have unequal-sized class intervals. This is 
particularly true of statistics of wages and income and statistics 
of hours of labor. 


Tunn 18—Deatas Dur To AUTOMOBILE AccipENTS IN 150 Crrres,* 
First 20 Weeks or 1940 


Number of deaths due to automobile Ge Calculations of 
accidents Number of cities deviations from an 
whose automobile arbitrary origin 
=: accident fatalities A = 15; 
x | were as specified 
Intervals Mid-values gh 
F i 
0- 0.5 11 —1.45 
1- 2.0 23 —1.30 
3- 4.0 34 —1.10 
5- 7.5 38 —0.75 
10- 15.0 24 0 
20- 25.0 12 1.00 
30- 35.0 4 2.00 
40-51 45.5 4 3.05 
150 


* New York, Los Angeles, Chicago, and Detroit are excluded from these statistics. 
United States Bureau of the United States Census, Weekly Accident Bulletin, May 24, 1940, 
pp. 1-4. 


ti = 10. 

If the mid-value of the class interval 10- is taken as the 
arbitrary origin, that is, A = 15, and the “class interval” or 
abscissa scale unit 7 is taken equal to 10 (since that size interval 
predominates), the deviations of class intervals in that part of the 
frequency distribution where class intervals are equal are readily 
determined. Where the class intervals are unequal, simple sub- 
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traction of mid-values and the division of the answer by the 
scale unit gives the results in the last column of Table 18. To 
illustrate the process, there is a difference of 10.5 between mid- 
value 35 and mid-value 45.5, a quantity 1.05 times the scale 
unit. Accordingly, the deviation advances from 2.0 to 3.05 
scale units. In the lower reaches of the range there is a difference 
of 7.5 between mid-value 15 and mid-value 7.5, or 4 a scale unit; 
consequently, the step-deviation change is from 0 to —0.75. 
From mid-value 7.5 to mid-value 4, the deviation recedes 0.35 
a scale unit to —1.10. From mid-value 4.0 to mid-value 1.5, a 
distance of + an interval, the scale-unit deviation changes from 
—1.10 to —1.35. Finally, from mid-value 1.5 to mid-value 0.5, 
a distance of j'y a scale unit, the scale-unit deviation recedes 
from —1.35 to — 1.45. 

From this point on, the analysis of the frequency distribution 
is the same as it would be were uniform class intervals used, 
although obviously the uneven numbers add somewhat to the 
burden of filling in the work sheet according to the plan shown in 
Table 16. Once the work sheet has been completed, however, 
the fact that the class intervals are not uniform ceases to be a 
consideration in the subsequent computations; the summation 
figures can be applied in the formulas in precisely the same 
manner as if the class intervals were uniform. 


ACCURACY IN THE CALCULATION OF STATISTICS 


Ordinary common sense would dictate that all recording of 
figures needs to be carefully checked, since there is always a 
chance of making a mistake in copying. Such mistakes are not 
statistical errors to be disregarded under the “theory of errors," 
which is explained in Chap. XI. They cannot be disregarded, 
and every effort should be made to prevent their occurrence. 
The same applies to all calculations made, but frequently 


` short-cut or cross checks can be devised for these. While 


accuracy is essential, a spurious accuracy may be introduced into 
final answers. For example, in most cases final figures repre- 
senting samples should be presented in round numbers, including 
only the significant figures in the arithmetical answers obtained. ! 

Care must be taken, however, in cases where errors are likely to 
accumulate through successive steps of calculation. It may be 


1 The meaning of “significant” is explained on p. 213, note. 
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necessary to retain the figures in a calculated result for a number 
of places beyond the significant figures if that calculated result 
is being used in the process of calculating other statistics. In 
some statistical problems it is necessary to add a constant 
suecessively perhaps fifty or even hundreds of times, or, similarly, 
to multiply by a constant successively a large number of times. 
In such instances the constant should be written to several more 
places than will be used in the final answers in order to avoid 
an error in significant figures at the end of the process. This is a 
purely mathematical problem; in every case, the standard of 
accuracy required, or the number of significant figures, having 
been decided upon, a simple arithmetical calculation will show to 
how many places the intermediary calculations must be carried. 
The final results are then rounded off to the number of significant 
figures. 

In rounding numbers the rule is that a remainder less than 
half a unit is disregarded, while half or more than half is counted 
as an additional unit. Exactly half may be changed to the 
nearest even number—thus 174.5 would be 174 but 175.5 would 


be 176. 


PART III 


The Normal Frequency Curve 


CHAPTER VIII 


PROBABILITY 


Up to this point, the discussion has primarily been concerned 
with “descriptive statistics.” Attention has centered upon 
methods of summarizing and describing statistical variation. 
Occasionally, theory has been employed to explain certain 
methods or to indicate why one method is to be preferred to 
another; but, in general, emphasis has been upon the facts as 
such, rather than upon any theoretical explanation of or inference 
to be made from these facts. 

Tn contrast, the next four chapters will be primarily concerned 
with a particular body of theoretical statistics, namely, the theory 
of the normal frequency curve. The question now to be con- 
sidered is not “what” is the character of a given frequency dis- 
tribution, but “why.” The discussion will be abstract and 
general and will not pertain to actual concrete data, except by 
way of illustration. 

Before this theoretical analysis can be undertaken, however, 
certain mathematical tools must be acquired and certain funda- 
mental concepts clarified. That is the purpose of this and the 
next chapters. 


PERMUTATIONS AND COMBINATIONS 


Permutations Defined and Illustrated. A “permutation” is 
an arrangement. The word “man,” for example, is a special 
arrangement of the three letters m, a, and n. Other possible 
arrangements of these three letters are: mna, nma, nam, anm, 
and amn. All these arrangements are permutations. 

232 
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In general, if there are N different things, it is possible to form 
AN! different permutations. Consider again the three letters 
m, a, and m. In making various arrangements of these, it is 
possible to pick the first letter in three different ways. The 
first letter having been picked, there are then left two different 
ways for the selection of the second letter. Finally, the first 
two letters having been selected, there remains one, and only one, 
way for the selection of the last letter. Now, each one of the two 
ways that are open for the selection of the second letter can be 
combined with each of the three ways that are open for the 
selection of the first letter, so that there are 3 X 2 different ways 
! picking the first and second letters. Since there is only one 

zay left in every case for the selection of the last letter, there are 

therefore 3 X 2 X 1 = 6 different ways of picking all the three 
letters, Thus, the number of different permutations of three 
hings is 3! = 6. If there had been 10 different letters, the 
number of different permutations of these would have been 
0xX9X8X7X06X5X4X3X2X1 = 10! = 3,628,800. 
Suppose, now, that among 10 different things 3 are to be 
selected for some particular purpose, the exact nature of the 
urpose being immaterial for the analysis. The question is: 
In how many different ways may a subgroup of 3 be selected 
‘rom the total of 10; in other words, what is the number of differ- 
ent permutations that can be made of 10 things taken 3 at a 
time? This question may be answered as follows: It is possible 
O select the first of the subgroup of 3 in 10 different ways, the 
second in 9 different ways, and the third in 8 different ways. 
There are thus altogether 10 X 9 X 8 different ways in which 
the 3 things may be selected from the total of 10. Accordingly, 
the number of different permutations of 10 things taken 3 at a 
time is 10 X 9 X 8 = 720. In general, the number of different 
permutations of N things taken r at a time is 


P'-N(N-10)-::- to r factors 


that is, 
P'S2N(NN—1(N—-2):--(N—r-4 )) (1) 
Combinations Defined and Illustrated. A “combination” is 
not the same thing as a permutation. A group of 3 letters con- 


! No is to be read “N factorial” and signifies the successive product of N 
by all the integers less than N and greater than zero. 


ə 
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À : . 
stitutes a combination of these 3 letters; but as has just been 
seen, this combination ean be arranged in 3! different ways. In 
other words, it is possible to have 3! permutations of a single 
combination of 3 things. In general, it is possible to have N! 
permutations of a single combination of N things. 

Although a group of N things forms but a single combination, 
subgroups may be picked in such a way as to constitute different 
combinations. Suppose, for example, that the board of directors 
of a given corporation consists of 10 men and the chairman. 
The chairman wishes to pick a committee of 3 men. In how 
many different ways can such a committee be constituted, the 
chairman himself being excluded? This is a question of how 
many different combinations of 3 men may be taken from a 
group of 10 men. It will be noted that the order of selection 
is immaterial, for it is only the constituency of any committee 
that differentiates it from other possible committees. 

The answer to this question is obtained as follows: Let C1" 
represent the number of combinations to be calculated, viz., the 
number of different combinations of 10 things taken 3 at a time. 
Each one of these combinations, it will be recalled, can be 
arranged in 3! different ways; i.e., there'are 3! different ways in 
whieh a given committee can be selected. Accordingly, the 
total number of ways in which a committee of 3, i.e., just any 
committee and not a particular committee, can be chosen is 
equal to 03° X 3!. But-the total number of different ways in 
which a committee of 3 can be picked from a group of 10 is the 
number of permutations of 10 things taken 3 at a time, which is 


equal to 10 X 9 X 8. Therefore, CH 3! — 10 X 9 X 8, and 


10X9X8 
Gye ~~ ap ` In general, the number of different com- 


binations of N things taken r at a time is 


NON — DON 2) (N= 
Cy = N(N )(N 2 (N—r-41) (2) 
or if numerator and denominator are both multiplied by ON — r)!, 
M N! 
EE EE (3) 


The Binomial Expansion. A use that is made of combina- 
torial theory in elementary algebra is to find a formula for the 
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expansion of the binomial (z + y)*. It will be recalled that 
(x + y)? = (z + y)(x + y) is found by multiplying each term 
of the first factor by each term of the second factor and adding 
these partial products. Thus 


(z + y)? = z2 + ay + ay + y? = z? + Qry + y? 


A higher powered binomial can be evaluated by mere repetition 
of this process. Thus 


Gc +y)’ = (x +y) + ye + y) = (2? + Zen + y?)(@ + y) 
= z + 2xty + ry? + xy + Zeng + y* = z? + 322 + 3zy? + y? 


It will be noted that the result in each case consists of a series of 
terms in diminishing powers of z (or rising powers of y), and this 
is generally true no matter what the power of the binomial. 
It will also be noted that the number of times a given term 
occurs (e, its coefficient in the expansion) depends on the 
number of ways the 2’s (or y’s) that make up that term can be 
selected from the different factors. Thus in the case of (z + y)? 
the term composed of three z's, that is, z3, can be formed in only 
one way, namely, by taking an z from each of the three factors. 
The term z?y, however, which contains two zs, can be formed 
in three ways. This is because the number of different com- 
binations that can be made of three z's taken two at a time is 
aa = 3. Similarly, the coefficient of zy? is the number of 
different combinations of three z's taken one at a time, which is 
SEET 3. Accordingly, the expansion of (z + y)? might 
be written (z + y)? = Ciz? + Chay + Cisy? + Coy’, where Cj 
means the number of combinations of three things taken three 
at a time, C3 equals the number of combinations of three things 
taken two at a time, etc.,! the evaluation of these quantities to be 
determined by Eq. (3). If consideration were given to the 
power of y instead of z, this new method of writing the expansion 
of (x + y)* would become 


(z + y)* = Ciz? + Gem + Cray? + Cay’ 


1 Note that 0! is taken by convention to be 1, so that Cj = 1. 
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In general, 


(z + ai = CXaN + Charly erer cb C gtys-i 
+ Clay" + Coy” (4) 


or, on using the second method of expression, 


(x + y) = Cyan + Cany +: + ON EYN? 
+ Cyzy + Cxy® (4a) 
Thus 
(z + y)* = Cixt + City + Carty? + Com + Coy* 
= z“ + 4z%y + 6a?y? + Ay + y* 
and 


(z + y)" = Chat + Choty + Cen? + Cent + Cixy* + Ciy’ 
= vb + bein + 10z?y? + 102?y* + 5zy* + ai 


It is in this way that the combinatorial formulas enter into the 
binomial expansion. Later it will be seen that a certain fre- 
quency distribution is called a “binomial distribution” because 
its relative frequencies are computed in the same way as the 
coefficients of the terms of a binomial expansion. 


MATHEMATICAL PROBABILITY 


The concept of probability has been the subject of much 
debate among philosophers, mathematicians, and statisticians. 
To enter into this debate, however, would be beyond the scope 
of this book.! Although the concept of probability presented 
below appears to be the most suitable for an elementary text 
and is apparently the one most in favor among statisticians, it 
must not be thought that other approaches are necessarily 
invalid or even possibly less fruitful.? 


1 A brief review of the classical theory and the frequency theory of R. von 
Mises is presented in the Appendix, pp. 242-251. 

? The concept of probability presented in this book is patterned after that. 
presented by J. Neyman in his Lectures and Conferences on. Mathematical 
Statistics (Graduate School of the United States Department of Agriculture, 
Washington, 1937). His views, Dr. Neyman believes, “are shared by E. S. 
Pearson and other workers attached to the Department of Statistics at 
University College, London." He also refers to H. Cramer, Random 
Variables and Probability Distributions (Cambridge, 1937); Maurice Frechet, 
Recherches theoriques modernes sur la theorie des probabilités (Gauthiers- 
Villars, Paris, 1937); A. Kolmogoroff, Grundbegriffe der Wahrscheinlich- 
keitsrechnung (Julius Springer, Berlin, 1933); and D. J. Struik, *On the 
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Definition. A° discussion of probability can best begin with a 
finite set of objects. Suppose that, in a given set of t objects, m 


possess a given property and n do not possess this property. 
Then the probability of an object of this set having the given 
property is m/t or the relative frequency of these objects in the 


set. The word “object” as used in this definition is to be 
interpreted broadly. Besides objects proper, it may be taken 
to include events that have the property of occurring or even 
propositions that have the property of being true. 

To illustrate the above definition of probability, consider an 

ordinary deck of 52 playing cards. This will have 26 red cards 
and 26 black cards; hence the probability of a red card in this 
deck is 2$ = 4. The deck also contains 13 cards of each suit, 
so that the probability of a heart, say, is 1$ = 1. This is also 
the probability of a diamond, or a spade, or a club. 
Description of Fundamental Probability Set. It should be 
ally noted that in defining a probability the set of objects 
to which it pertains must be precisely designated and the prop- 
erty of an object to which the probability refers must be care- 
fully distinguished. For example, the probability of an ace in 
a pinochle deck! is 3 = # and not de = js, as it is in an ordinary 
deck. Furthermore, for the same set of cards, the probability 
of a card of a given color is not the same as the probability of a 
card of a given suit or of a card of a given value. What is more 
important is that each of these properties and hence their prob- 
abilities pertain to a different classification of the objects of the 
set. As will be seen later, it is possible to add probabilities per- 
taining to the same classification of the objects of a given set, 
but not probabilities pertaining to different classifications, even 
though the set of objects is the same. A set of objects classified 
in a given way is called a “fundamental probability set.” In 
all calculations it is very important to define carefully the funda- 
mental probability set that is involved. 

Tn this connection it should be noted that the “probability of 
a heart in an ordinary deck of cards” is not necessarily the same 
thing as the “probability of drawing a heart from the deck." 


Foundations of the Theory of Probabilities,” Philosophy of Science (1934), 


Vol. 1, pp. 50-70. $ 
1 À pinochle deck consists of 2 aces, 2 kings, 2 queens, 2 jacks, 2 tens, and 


2 nines of each suit. There are no cards of lower value. 
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For the former, the fundamental probability set is precisely 
designated; it is simply the given deck of cards classified accord- 
ing to suit. The total number in the deck may be readily 
counted; the hearts may be easily separated from the others; 
and their relative frequency, t.e., their probability, may be 
directly computed. But what is the fundamental probability 
set to which the ‘probability of drawing a heart from the deck” 
pertains? To this there are several answers. 

Suppose that 100 drawings are made from the given deck, 
the card drawn each time being replaced in the deck and the 
whole well shuffled before the next drawing. Let the number 
of hearts so drawn be 20. Here the fundamental probability 
set is the set of 100 drawings classified according to suit, and 
the probability of a heart in this set is 4% = +4. In this case 
also, the total number of objects can be counted and the num- 
ber having the given property can be readily ascertained. 

The “probability of drawing a heart from the deck" may, 
however, pertain to a set of 100 drawings to be made in the 
future. Here the total number of “objects” in the set is given, 
but there is no way of ascertaining how many of these drawings 
will yield hearts. In this case, the “probability of drawing a 
heart from the deck" is simply unknown. 

Finally, the “probability of drawing a heart from the deck” 
may pertain to a set of hypothetical drawings, not actual draw- 
ings. If, it may be argued, 100 drawings should be made from 
the deck in the prescribed manner and if 30 of these should be 
hearts, then the probability of a heart in this assumed set would 
be Ai, The “probability of drawing a heart from the deck" 
refers in this case to a hypothetical set. 

Infinite Probability Sets. Frequently, probability theory is 
concerned with an infinite set of objects. These are usually 
hypothetical sets but may in some cases be real sets, such as the 
infinity of points on a line. Without going into mathematical 
refinements, it may be said that the probability of an object 
of a given property in an infinite set is the percentage of such 
objects in the set. For the percentage of a particular kind of 
object in an infinite set may be finite even if the number of 
objects of the given property and the total number of objects 
are both infinite. For example, if a coin is tossed indefinitely, 
both the total number of tossings and the number of tossings 


PROBABILITY ` 239 
" » 

yielding heads may be increased without limit. Nevertheless, 
the ratio of the number of heads to the total number of tossings 
will stay within finite limits no matter how many tossings are 
made. For an infinite set, therefore, as well as for a finite set, 
the probability of an object having a particular property is the 
relative frequency of such objects in the given set.! 


PROBABILITY AND THE RELATIVE FREQUENCY 
OF ACTUAL EVENTS 


In concluding this chapter a few words should be said about 
the relationship between mathematical probability and the 
relative frequency of actual events. As defined above, prob- 
ity is a constant characterizing a given set of objects; it is 
merely a mathematical abstraction. If the theory of prob- 
ability is to be of any practical use, however, it must be tied to 
the relative frequency of actual events. It must help, in other 
words, in making predictions about real life. 

The Law of Large Numbers. The link that ties mathematical 
probability to the relative frequency of real events is actual 
experience with mass phenomena. This experience has been 
called the “law of large numbers,” which says that, when a large 
number of random events is involved, it is usually possible to 
predict, with reasonable accuracy, the relative frequency of 
occurrence of a particular event by calculating a certain mathe- 
matical probability. To illustrate, consider once again an 
ordinary deck of playing cards. Mathematically, this can be 
looked upon as a set of 52 objects for which the probability of a 
heart is 43 = 4. Let a large number of drawings, say 1,000, be 
made from this deck, the card drawn each time being replaced 
and the deck well shuffled before the next drawing. As already 
pointed out, no exact statement about the number of hearts 
drawn can be made in advance of the drawings. Experience 
shows, however, that in random drawings of this kind the relative 
frequency of hearts drawn approximates fairly well the mathe- 
matical probability of a heart in the deck. Hence, in the given 
instance it may be predicted that of the 1,000 random drawings 
something close to 250 will be hearts. 


1 For a more refined definition of probability, see Neyman, op. cit., pp. 
10-11. 
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The foregoing is a very simple illustration of the law of large 
numbers. The law appears to be equally valid, however, for 
more complicated calculations of probability. For example, 
suppose there are two decks of cards, one an ordinary deck and 
the other a pinochle deck, and suppose that all possible combina- 
tions of two cards are made by combining one card from the 
ordinary deck with one card from the pinochle deck. Since the 
first eard can be picked in 52 ways and the second in 48 ways, 
there will be 52 X 48 = 2,496 such combinations. Of these 
2,496 combinations, 4 X 8 = 32 will be pairs of aces; hence, in 
this set of combinations, 32/2,496 = yx is the probability of a 
pair of aces, Now let a very large number of drawings be made 
from each deck of cards, the card drawn each time being replaced 
and the deck well shuffled before the next drawing, Further- 
more, let the first card drawn from the ordinary deck be paired 
with the first card drawn from the pinochle deck, the second 
card from the ordinary deck with the second card from the 
pinochle deck, ete. Then, if the number of random drawings 
is very large, experience shows that the pairs of aces actually 
occurring in this large set of drawings will be close to ze times 
the total number of drawings. Again the relative frequency of 
actual events can be approximately predicted by the computation 
of a mathematical probability. In fact, if random mass phe- 
nomena are involved, the whole of the calculus of probability can 
be employed in the prediction of relative frequencies with satis- 
factory accuracy. 

Empirically Determined Probabilities. It might be pointed 
out in passing that in many instances the original set of objects 
is not completely known and the probability of a given property 
of the set must be determined empirically. For example, the 
total number of deaths in the United States of white males, age 
fifty, is not completely known. Indeed, so far as we know, 
deaths of men, age fifty, will continue to occur indefinitely. Thus 
of the total number of men who have reached and will reach the 
age of fifty, the number who have died or will die during their 
fiftieth year is not precisely known. On the basis of the law of 
large numbers, however, it seems safe to assume that the many 
vital statistics that have been accumulated give a very close 
approximation to the true probability of death at age fifty. That 


1 Gf. Chap. X. 
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this assumption is justified is again verified by actual experience. 
Thus, if the empirically determined probability of a man dying 
at age fifty and the empirically determined probability of his wife 
dying at age fifty are used to calculate! the probability of both 
a man and his wife dying at the age of fifty, experience with large 
masses of data shows that the relative frequency of such pairs 
of deaths at age fifty does actually approximate the calculated 
probability. The ealeulus of probability ean thus be used by 
lifc-insuranee. companies with general success. Similar results 
have been found true of other empirically determined probabili- 
ties. The law of large numbers thus appears to be universally 
OR 

“ Randomness.” Tt will be noted that the law of large numbers 
applies only to mass phenomena that are “random.” This is 
very important. If it happened, for example, that, in drawing 
pairs of cards from an ordinary deck and a pinochle deck, some 
method of selection were used that caused aces to appear in some 
cyclical order, say an ace on every tenth draw from the ordinary 
deck and on every fifth draw from the pinochle deck, then the 
relative number of pairs of aces occurring would not equal the 
computed mathematical probability, For, in this case, pairs 
of aces would occur on every tenth draw, and the probability 
of n pair of aces in the infinite set of drawings would be yy and 
not yy, as computed above. 

“Randomness” cannot be exactly defined. Fundamentally, 
it is an intuitive concept. General notions suggest that to be 
random the occurrence of an event must be related in no way to 
its property; e.g., the drawing of an ace must be unrelated to its 
being an ace. Nor must a random series of events show any 
relationship between the members of the series. In other words, 
events must occur in complete disorder; they must be unpre- 
dictable by any formula. But, after all, these are negative 


1 See Chap. X. 

The association between mathematical probability and the relative 
frequency of real events is not essentially different from the association 
between mathematical models in other sciences and happenings in the real 
world, In physies, for example, the closeness of the association is good 
enough to enable mathematical formulas to be used in the construction of 
oridges, automobiles, and the like, In other words, the justification of the 
theory is that it works. 
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criteria. The positive content of randomness must be left 
undefined. 


APPENDIX 
A REVIEW OF THREE IMPORTANT CONCEPTS OF PROBABILITY 


As pointed out in the main body of this chapter, various concepts of 
probability are admissible. Altogether there appear to be three principal 
concepts that have contended for acceptance by scientists and philosophers. 
These may be described as the “classical concept,” the “frequency concept,” 
and the “intuitive-axiomatic approach" to probability. It is this last 
that is used in this book. Since it is an outgrowth of the conflict between 
the other two lines of thought, they will be discussed first. 

Classical Concept of Probability. Historical Background. Although 
commercial insurance was practiced by the Babylonians and was well known 
to the Greeks and Romans, the development of a theory of probability, 
such as that on which modern insurance practice is based, dates back only 
to the seventeenth century. Furthermore, it was not in the field of business 
that the seeds of this probability theory were sown, but in the gambling 
rooms of the French gentry. In 1654, Antoine Gornbaud, chevalier de 
Méré, a French gentleman with an interest in mathematics, called upon the 
French mathematician Pascal for the solution of a particular gambling 
problem. The ensuing mathematical speculation marked the beginning 
of the investigation of games of chance. Subsequently there appeared 
various works by Huygens (1657), Jacques Bernoulli (1713), De Moivre 
(1718), and Bayes (1764), most of which were concerned with the applica- 
tion of the theory of permutations and combinations to the calculation of 
probabilities associated with various dice and card games. 

Meanwhile, French and English experimentalists, mathematical physi- 
cists, and astronomers were concerning themselves with errors of measure- 
ments. Simpson (1757) examined the implications of taking the mean of 
a set of astronomical measurements as the best estimate of the true value, 
and Lagrange (1770) published a memoir dealing with the “probable 
error" of the mean.? Other names associated with the early development 
of the theory of errors are Boscovich, Lambert, Euler, Daniel Bernoulli, 
and Legendre.* The development of such concepts as “inverse probability” 
and probability of “causes” also led at this time to growing philosophical 
speculations on the theory of probability. Furthermore, the collection of 
mortality statistics led to the computation of mortality tables and the 
development of actuarial science. 

All these investigations—the analysis of gambling games, the formulation 
of a theory of errors, and philosophical speculation—reached their culmina- 
tion in the great work of Laplace, Théorie analytique des probabilités (1812). 


1 See Smith and Duncan, Sampling Statistics, pp. 155-162, for a discussion 
of various methods employed to get random samples. 

* Cf. Levy, H., and L. Rorn, Elements cf Probability (1936), pp. 5-6. 

* Cf. Naagr, E., Principles cf the Theory cf Probability, p. 10. 


s PROBABILITY 243 
This master synthesis contains all the essentials of the classical theory of 
probability and most of the important deductions from it. From the time 
of Laplace, developments of probability theory in the fields of philosophy; 
logic; mathematics; physical, chemical, and biological research; and the 
social and industrial arts and sciences were all bound to react on each other 
and to build on the same broad foundation. In the sense that he thus 
fused together the various lines of development, Laplace may be looked upon 
as the formulator of the classical theory of probability. 
The Classical Concept. The definition of probability given by Laplace 
and generally adopted by disciples of the classical school runs as follows: 
Probability, it is said, is the ratio of the number of “favorable” cases to the 


total number of equally likely cases. For example, if a coin is tossed, there 
are two equally likely results, a head or a" tail; hence the probability of a 
head is }. If a die is thrown, there are six equally likely results, and the 


probability of any particular one of these results, say a five, is therefore š. 

again, if a die is thrown, the probability of obtaining an even number is 
for three of the six equally possible results are even numbers. This 
ample illustrates how the classical theory derived the addition 
theorem. For it will be noted that the probability of getting a particular 
one of the even numbers is in each case 3. But it has just been shown that, 
the probability of any even number is ü = š + š + i; hence, the theorem 
follows that the probability of any one of a number of mutually exclusive 
events is the sum of their individual probabilities. 

Still another example will illustrate how the classical concept led to the 
multiplication theorem, Suppose three coins are tossed. Since either 
one of two results on the first coin can be combined with either one of two 
results on the second coin and any one of these combinations can be com- 
bined with either one of two results on the third coin, there are altogether 
2 X 2 X 2 = 8 equally possible results. The number of these eight possible 
combinations that have all three heads is 1. Hence, the probability of all 
three heads on the tossing of three coins is 4. But this is the same as 
the product of the individual probabilities of a head on each coin, i.e., 
(3)(4)(4) =}. In general, the probability of the joint occurrence of inde- 
pendent events? is the product of their individual probabilities. These are 
all illustrations of how the classical theory of probability, in line with its 
definition of the term, sought in every case to resolve a problem into a set 
of equally likely eases and then by the application of combination formulas 
to determine the number of “favorable” cases. 

Criticism of the Classical Concept. There is little criticism of the theo- 
rems built up by the calculus of probabilities on the basis of the classical 
definition insofar as they represent merely a set of logical relationships. 
Generally, the same set of relationships can be demonstrated on the basis 
of other definitions of probability. Criticism of the classical concept centers 


1 Cf. Levy and Borg, op. cit., p. 8. 

2 The independence of the individual events is necessary for this theorem 
to hold true. In the given example, the probability of getting a head on 
any one coin is independent of the results obtained on the other two, 
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rather in the meaning of the results obtained and the adequacy of the theory 
for handling problems outside the field of gambling games, as in the statis- 
tical analysis of physical, biological, and economie data. 

Meaning of “Equally Likely." The principal line of attack on the 
classical concept is directed against the terms “equally likely cases." 
What does “equally likely” mean, it is asked. Is not “equally likely" 
merely another way of saying “equally probable," and in that ease is not 
the classical definition of probability a circular one, since it defines probabil- 
ity in terms of itself? To avert criticism, some rule must be laid down for 
the determination of “equally likely” or “equally probable” that is inde- 
pendent of ‘‘probable.”! What then were the rules of the classicists for 
determining equal likelihood? 

In the development of the classical theory, two procedures were offered 
for the determination of “equally likely” cases. One was the principle 
of sufficient reason and the other the principle of indifference, or the principle 
of the equal distribution of ignorance, as it was sometimes called. The first 
procedure was followed when a person examined all available evidence 
relevant to the event in question and noted that this evidence was symmetri- 
cal with reference to the various possible results. For example, after a 
thorough examination of a die, including a nice determination of its center 
of gravity and the moments of inertia about various sides, it might be con- 
cluded that the investigator had sufficient reason to consider the die perfectly 
symmetrical and hence the six possible results equally likely. According 
to the principle of indifference, on the other hand, if the investigator knew 
nothing about the die in question, he had no basis for deeming one side of 
the die to be different from any other and could therefore assume them all 
to be equally likely. This second procedure is subject to particularly 
severe criticism and will be discussed first. 

Principle of Indifference. "Total ignorance about a thing, it may reason- 
ably be argued, can scarcely be a source of any knowledge concerning it, 
even of that uncertain kind afforded by a probability statement. In other 
words, how can something be got out of nothing? As might be expected, 
the use of the principle of indifference has frequently led to paradoxical 
results. This has been generally true whenever the set of “equally likely” 
cases was not discrete but was represented by a continuous variable. Sup- 
pose, for example, it is known that the weight of a certain man is at least 
equal to that of his wife but is not more than double her weight. If ignor- 
ance as to the exact ratio of weights is “evenly distributed," it may be 
concluded that any ratio of the man’s to the woman’s weight lying between 
1 and 2 is as likely as any other within that interval, From this it follows 
that the probability of its lying between 1 and 1.5 is 50 per cent and that the 
probability of its lying between 1.5 and 2 is also 50 per cent. Suppose, 
however, the ratio of the woman’s to the man’s weight had been taken as 
the variable. The limits would then be 0.5 and 1, and according to the 
principle of indifference all possible values of the ratio lying within this range 
might be deemed equally likely. Then, however, it may be concluded that 


1 (f. NAGEL, op. cit., p. 46. 
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the probability of the ratio of the woman’s weight to the man’s weight lying 
between 0.5 and 0.75 is just 50 per cent and thatthe probability of its lying 
between 0.75 and 0.1 is also 50 per cent. But this second result is in dis- 
agreement with the first, for a ratio of the woman’s to the man’s weight 
equal to 0.75 is the equivalent of a ratio of the man’s weight to the woman’s 
weight equal to 1.38. Thus, according to the second method of distributing 
ignorance, the probability is 50 per cent that the man is 1 to 13 times as 
heavy as the woman; and, according to the first method, the’ probability 
is 50 per cent that he is 1 to 1} times as heavy. In general, when the 
principle of indifference is employed to determine what is equally likely, 
a change in the coordinates used to describe a continuous variation in 
possible results frequently affects the values of the computed probabilities.! 

Principle of Sufficient Reason. The first method of procedure, viz., that 
of sufficient reason, puts the theory on a more solid basis. It has been 
criticized, however, in respect to its practical applicability, Even after 
the symmetry of a die has been carefully determined, it is still necessary to 
note any lack of bias in the method of throwing it or again in the surface 
on which it rolls, To determine these a priori are matters more difficult 
than the symmetry of the die itself. Much greater, however, is the difficulty 
of determining equal likelihood when attention is turned from dice, coins, 
ar ds to the phenomena of the scientific laboratories and of everyday 
life. An insurance company insures the lives of a thousand men; how can 
it determine a priori whether they are all equally likely to die during the 
year? Some men are tall, others short; some are fat, some thin; some work 
outdoors, others indoors. Is there any possibility of the insurance com- 
pany telling, other than from its actual experience with men of various 
el: i.e, a posteriori, whether these individual differences destroy the 
likelihood of death? Critics of the classical theory answer with 
an emphatic “no.” It is easily seen, they say, why the classical theory 
developed out of a study of games of chance, for it is in that field alone that, 
ther any reasonable possibility of determining a priori whether a set of 
possible results are “equally likely." In most other fields it is impossible.? 

But why insist on a rational a priori determination of equal likelihood, 
the reader may ask. Why not determine it a posteriori? For example, to 
determine whether a head or a tail is equally likely for a given coin, why 
not toss the coin a large number of times and note whether the number of 
heads is approximately equal to the number of tails? . This sounds simple 
enough until it is studied more closely, Then the question immediately 
arises: How good an approximation is necessary—how close to 0.5 must the 
ratio of heads to tails be—before it can be concluded that a head and a tail 
are equally likely? On the assumption of equal likelihood, the classical 
theory itself explains that in a finite number of tosses any given result is 
possible, although of course all results are not equally probable. If N 
tosses are made, for example, there are 2” equally possible results, * and of 

1 Cf. vox Mises, Ricuanp, Probability, Statistics, and Truth (1939), pp. 
114-115. 

2 Cf. vox. Mises, op. cit., pp. 98-110. 

* Cf. p. 274. 
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these the number that would have r heads would be the number of combina- 
š e NI 
tions of N things taken r at a time,! or CY = ANS Hence, any 


value of r/N would be possible with a probability of o Tu D Gi If 
N is large, the probability that r/N should deviate considerably from 0.5 
is very small and the general reasonableness of the hypothesis of equal 
likelihood could be tested by determining the probability of as large a 
deviation from 0.5 as is actually found.? 

The empirical results therefore do not provide a certainty that a head 
or a tail is equally likely. There is no way of telling for sure whether the 
deviation from an exact value of 0.5 is due merely to chance or whether the 
coin is actually biased. A similar result might have been produced by a 
biased coin. In fact, the latter might on occasion produce an exactly equal 
number of heads and tails, so that, even if the ratio of r to N is exactly 0.5, 
it is not certain that the coin is perfectly unbiased. If, as suggested above, 
the hypothesis of equal likelihood is accepted or rejected on the basis of the 
probability of getting the given deviation from 0.5, then equal probability 
is being determined in a way that is dependent on the concept of probability 
and the criticism of circularity in the definition becomes immediately valid.* 

A still further criticism is this: Suppose that after a careful examination 
of a die it is found that the die is not symmetrical, what then? If a die is 
biased, how can the problem be resolved into a set of equally likely cases? 
True, if it can be determined that, through careful weighing, balancing, 
rotating, etc., the occurrence of an even number is twice as likely as the 
occurrence of an odd number, it might be argued that there are nine equally 
likely results, one of which is a one, two of which are twos, one of which is a 
three, two of which are fours, one of which is a five, and two of which are 
sixes, and the various combinatorial formulas might be based on this assump- 
tion. Even if the possibility of making such an a priori determination is 
not questioned, there still remains the problem of how to treat the case where 
the bias is such that a given result or results is 1.5 or 3.67 or z times as 
likely as some other result. Laplace himself attempted a solution of this 
problem but failed to obtain a correct answer.‘ It would seem that the 
problem is insoluble on the basis of the classical concept.* 

Subjective Character of Classical Concept. Since the foregoing criticisms 
have in a way implied that “probability?” was more or less objective in 
character, in all fairness to the classical theory it should be pointed out 
that Laplace and most of his followers took a subjective view of the concept. 

1 Of. p. 234. 

2 Cf. pp. 248-249 for a further discussion of this. 

3 The “frequency” concept expounded below does not suffer from this 
criticism because, according to this concept, probability is defined as the 
limit that the ratio of heads to total results approaches as the number of 
tosses is indefinitely increased. No circularity arises from the need of 
determining equal probability. See pp. 250-251. 

4von Mises, op. cit., p. 102. 

5 Cf. NAGEL, op. cit., p. 45. 
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Probability to them was a “rational degree of belief.” They considered 
the word “probability” as meaning a state of mind regarding a given 
statement, a future event, or any other thing about which absolute knowl- 
edge was not to be had.! It was not made clear, by the classicists, how- 
ever, just how subjective their concept of probability was. If it were a 
mere measure of degree of (psychological) belief, the theory of proba- 
bility became a part of the science of psychology and immediately the 
question arose as to how degrees of belief could be added and multiplied, 
as called for by the probability calculus. On the other hand, if “rational 
degree of belief” was to be interpreted as what every intelligent person 
ought to believe under the given circumstances, then the theory assumed 
a certain degree of objectivity and all the foregoing criticisms became 
applicable with respect to the exact content of this standard of “oughtness,’’2 
It is the contention of the crities of the classical concept that, if probability 
theory is to be of practical use in statistical science, some objective definition 
of probability should be adopted. One writer points out that physical 
thermodynamics had its starting point in the subjective impressions of 
hot and cold but that its development began when an objective method was 
used to compare temperatures by means of a column of mercury.’ In the 
sume way, he concludes, probability should be put upon a physical, 
objective basis. 

Frequency Concept of Probability. If the reader will toss an ordinary 
com a large number of times, he will see that the ratio of the number of 
heads to the total number of tosses will be close to 0.5000, and approximately 
the same result will be obtained each time the experiment is repeated. This 
empirical fact, that in mass phenomena the relative frequency of a given 
attribute often appears to approximate a definite constant, is the corner- 
stone of the frequency concept of probability. The constant value that 
the relative frequency tends to approximate is identified with the “proba- 
bility” of the given attribute. 

This frequency approach to the theory of probability goes back at least 
to the work of J. Venn on the Logic of Chance published in London in 1886. 
In the present day, a leading exponent of this view is Richard von Mises, 
whose writings! constitute one of the most important formulations of the 
frequency theory of probability. The next section will be devoted to an 
exposition of his ideas. A subsequent section will discuss the ‘intuitive- 
axiomatic” approach, which is similar to von Mises’ theory but differs 
somewhat in its logical basis. ' 

Concept of von Mises. In von Mises’ theory, probability is defined only 
with reference to what he calls a “collective.” This is an infinitely large 
set of “random” elements that possess certain specified characteristics. 


! Cf. Nagel, op. cit., p. 44. 

2 Cf. NAGEL, op. cit., p. 46. 

š von Mises, op. cit., p. 112. 

‘The most important of these are Wahrscheinlichkeitsrechnung (Leipzig, 
1931) and Probability, Statistics, and Truth (1939), the latter being a trans- 
lation of an earlier (1928) German edition. 
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"The sequence of results obtained from an indefinite tossing of a coin or 
throwing of a die, the set of human births running back into the indefinite 
past and projected into the endless future, the sequence of parts turned 
out by the continuous operation of a given manufacturing process'—all 
these are examples of collectives. If the elements of such sequences take 
on varying attributes, such as heads or tails, male babies and female babies, 
acceptable and unacceptable parts, the first essential characteristic of a true 
collective is that the relative frequency with which a particular attribute 
occurs shall approach a fixed limit as the number of elements in the collective 
is indefinitely increased. Mathematically this means the following: If r/N 
is the relative frequency of a given attribute among N elements, e.g., the 
number of heads among N tossings of a coin, and if p is the limit that 
r/N approaches as N is increased indefinitely, then after some point, say 
N = 1,000,000, the difference between r/N and p becomes, and thereafter 
remains, less than an arbitrarily chosen positive quantity e say 0.005. 
Numerically it means that if r/N is caleulated to a given number of decimal 
places, as N is increased a point is finally reached after which further 
increases bring no change in the calculated figures. For example, if the 
ratio of heads to total tossings is calculated to three decimals, then, after 
some number of tossings, this ratio will always give r/N = 0.500. The 
limit that the relative frequency of a given attribute approaches as the 
number of elements of a collective is indefinitely increased is defined as the 
probability of that attribute in the given collective. 

The second characteristic of a true collective is its “randomness.” Thus 
the sequence of elements constituting a collective must be free from any 
regularity; they must be in complete disorder. It is to be noted that the 
relative frequency of an attribute may approach a limit in a given sequence 
without that sequence being a random one. If, for example, some special 
apparatus were constructed so that every fifth tossing of a coin resulted in a 
head and every other tossing in a tail, the sequence of results would look 
as follows: 


TTTTHTTTTHTTTTHTTTTHTTTTHTTTTHTTTTHTTTTH .... 


The limit of relative frequency of heads in such a sequence would be 4, but 
the sequence is obviously not a random one. It is consequently not a true 
collective, and it cannot be said that the probability of a head under the 
given conditions is 4. Actually the probability of a head on every fifth 
tossing is 1, and on every other tossing it is 0. 

What precisely constitutes “randomness”? von Mises’ answer to this 
fundamental question is as follows: If subsequences of elements are picked 
from the original sequence in such a way that the selection of a particular 


1 In reality, manufacturing processes change materially from time to time. 
What is envisaged here is one that remains exactly the same indefinitely. 
For a discussion of the statistical aspects of manufacturing processes, see 
W. A. Shewhart, Statistical Method from the Viewpoint of Quality Control 
(Graduate School of the United States Department of Agriculture, 
Washington, 1939). 
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element is independent of the attribute assumed by that element and if in 
all possible subsequences of this kind the limit of relative frequency of a 
given attribute is the same as in the original sequence, the latter may be 


said to be random. In the truly random tossing of a coin, for example, the 
selection of every fifth tossing would yield a subsequence of tossings in 
which the relative frequency of heads would approach the same limit (4 
if the coin is unbiased) as in the complete sequence. If such a method of 
selection were applied to the particular sequence of heads and tails given 


above, the result would obviously be a subsequence consisting of all heads, 
for which the limit of relative frequency would be 1 and not 1 as in the 
al sequence. This sequence clearly fails to meet the test of random- 
ness, Another example of randomness is provided by the game of roulette. 
If the results of the game are influenced solely by chance forces, i.e., if they 
constitute a truly random sequence, there is no way of placing bets so as to 
secure better than average results; no formula can be devised to “beat the 
house.” As von Mises puts it, the existence of randomness means the 
impossibility of devising a gambling “system.” 

In summary then, a true collective is a mass phenomenon or an endless 


origi 


sequence of observations for which (1) the. relative frequencies of the par- 
ticular attributes of the elements of the collective tend to fixed limits and 
(2) these fixed limits are the same for any place selection of a subsequence, 


selection that depends only on the location of an element in the col- 
lective and not on the attribute it assumes. The existence of such a collec- 
tiv the fundamental postulate of von Mises’ theory of probability. 

Criticism of von Mises’ Concept. Since the initial formulation of his 
theory in 1919,! von Mises’ concept of probability has been the subject of 
considerable discussion. Some have sought to refine and elaborate von 
Mises’ views; others have contended that they contain serious logical 
inconsistencies.? Here only a brief mention will be made of these criticisms. 

Since in real life only finite series can be observed, there has been some 
objection to a concept of probability based upon the notion of infinite 
sequences, One writer? has attempted to work out von Mises’ ideas, using 
finite series, but the complications are great and the results are not so 
comprehensive. After all, the concept of an infinite series only aims to 
give approximate results. It has been a useful mathematical tool in many 
other sciences; so why not in probability ?4 

Some writers have attacked the existence of limiting values. For 
example, in accordance with the classical theory of probability, if an unbiased 
coin is tossed N times, there is always a probability, however small, that the 


1See Mathematische Zeitschrift, Vol. 5. 

? See list of references in von Mises’ Probability, Statistics, and Truth, pp. 
316-318, references 35-51. 

š BLUME, Jonannes, Zur axiomatischen Grundlagung den Wahrscheinlich- 
keitsrechnung, 1934 (Dissertation Münster); Zeitschrift für Physik, Vol. 92 
(1984), pp. 232-252; Vol, 94 (1935), pp. 192-203. 

+ Cf. von Mises, Probability, Statistics, and Truth, pp. 121-122. 

5 Cf. Fry, T. C., Probability and Its Engineering Uses, pp. 88-91. 
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coin will turn up heads in all, or in a large proportion of, the N tossings. 
It is consequently argued that, whatever point is selected in the infinite 
sequence of tossings, there is always the possibility that the next N tossings 
will turn up such a large proportion of heads that the ratio of heads to total 
tossings will differ from the supposed limit p by more than the arbitrarily 
selected quantity e and hence contradict the mathematical criterion for the 
existence of a limit.! The answer to this criticism is that it is based upon 
another (the classical) concept of probability and is a proposition concerning 
the possible results obtained from a finite number of tossings. It is not 
in contradiction to another proposition that begins with a different view of 
probability and postulates the existence of a limit in an infinite sequence of 
tossings.* 

More serious criticism of von Mises’ theory has been directed against the 
condition of randomness. There is the question, for example, whether a 
series that is in complete disorder and cannot accordingly be described by a. 
mathematical formula can logically be conceived to exist. Still further, 
there is the question whether limits to relative frequencies in an infinite 
sequence can coexist with von Mises’ definition of randomness. Recent 
mathematical investigations, however, appear to have resolved this diffi- 
culty. It is claimed that, with a more carefully drawn and slightly less 
comprehensive definition of randomness, the type of collective described by 
von Mises can for all practical purposes be conceived to exist.* 


THE INTUITIVE-AXIOMATIC APPROACH TO PROBABILITY 


The theory of probability presented in the main body of this text is based 
upon the intuitive notion that theorems derived from axioms relating to 
relative frequencies approach satisfactorily the occurrence of events in 
real life. It may therefore be called the "intuitive-axiomatie" approach 
to probability. It is the concept of probability accepted by such men as 
Neyman, Fréchet, and Kolmogoroff.4 Lying between the classical concept 
of Laplace and the pure frequency theory of von Mises, it may be called 
a “compromise” concept. 


1 Bee p. 248. 

2 Cf. von Misgs, Probability, Statistics, and Truth, pp. 126-128, and the 
whole of the fourth lecture. Also, see NAGEL, op. cit., pp. 37. 

3 The fundamental papers supporting this claim are those of A. H. Cope- 
land, American Journal of Mathematics, Vol. 50, pp. 535-552; Vol. 51, pp. 
612-618; Vol. 53, pp. 153-162; and Vol. 58, pp. 181-192, and a paper by 
A. Wald, Ergebnisse eines mathematischen Kolloquiums, Wien, No. 8, pp. 
38-72. See, however, the recent criticism of Maurice Fréchet, Journal of 
Unified Science, Vol. 8, pp. 1-22. He refers to an example by Ville in 
which a given set is so defined that it meets all the conditions laid down by 
Wald and yet centains the regularity that the relative frequency always 
converges to its limit by values that are greater than p. He admits, how- 
ever, that von Mises does not feel that this creates any difficulty in his 
theory. 

* See footnote (2), p. 236. 
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The intuitive-axiomatie approach to probability differs from the classical 
theory in that it avoids the circularity of “equal likelihood.” It merely 
definies probability as a relative frequency of a certain attribute in a given 
set of objects without any statement as to whether these are “equally 
likely.” It consequently avoids introducing any subjective elements into 
the definition of probability.* 

On the other hand, the intuitive-axiomatie approach differs from von 
Mises’ approach in that it does not identify probability with the mathe- 
matical limit approached by the relative frequency of a given attribute in 
an infinite random sequence of actual events. It may indeed take the 
relative frequency of a certain attribute in a hypothetical infinite set as the 
probability of that attribute, but this need not be the mathematical limit 
approached by the relative frequency of any actual type of event. It 
merely says that, if relative frequencies of random mass data are treated as 
if they were mathematical probabilities of some hypothetical infinite set, 
then the calculations derived on this assumption will be satisfactorily close 
to the relative frequencies of various combinations of these data. “Ran- 
domness” is left as an undefined, intuitive concept. 


! The subjective elements and even the concept of equal likelihood enter 
some extent, however, in the determination of “randomness” (see pp. 241- 


CHAPTER IX 
PROBABILITY DISTRIBUTIONS 


A description of a fundamental probability set giving the 
various categories, or classes, into which the members of the set 
are grouped, together with the probabilities of each group, is 
called a probability distribution. The property of a member of a 
given group may be spoken of as an “attribute,” and a prob- 
ability distribution may be said to show how the total probability 
is distributed among the various attributes. Since the attributes 
of a fundamental probability set are necessarily mutually exclu- 
sive and since a member of a set must possess some one of the 
given attributes, it follows that the total probability, i.e., the 
sum of the probabilities of the various attributes, must equal 1. 
In other words, the percentages (probabilities) of the cases falling 
in the various groups must add up to 100 per cent. 

For example, the distribution of probability among the four 
suits of an ordinary playing card deck is 

Spades + or 25 per cent 

Hearts + or 25 per cent 

Diamonds + or 25 per cent 

Clubs + or 25 per cent 
The quality of being a spade, heart, diamond, or club is the 
attribute of a card. Since a card cannot be both a spade and a 
heart, say, these attributes are mutually exclusive; and since a 
card must belong to one of the four suits, the total probability is 1. 

Similarly, the distribution of probability among the six faces 
of an ordinary die is 


eR e= e= 
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The quality of being a 1, 2, 3, 4, 5, or 6 is the attribute of a face. 
Since a face of a die cannot be both a 2 and a 6, say, these attri- 
butes are mutually exclusive, and since a face must have one 
of the markings listed above, the total probability is again 1. 
Discrete Probability Distributions. When the attributes of a 
set are qualitative in character, such as spades or hearts in the 
case of a deck of cards or heads or tails in the case of coins, or 
when they are represented by a set of numerical values that do 
not vary continuously, such as the number of spots on the face 
of a die, the distribution of probability is said to be ‘ discrete.” 
If the attributes are represented by points on a horizontal axis 
and their probabilities measured along a vertical axis, a discrete 


0.16 
0.5 
LEM A STG 
Heads Tails Spots on face of die 
Fic. 77.—Distribution of Fic. 78.—Distribution of 
probability of heads and tails on probability of the spots on the 
à coin. faces of a die. 


probability distribution may be pictured by a series of lines or 
bars as in Figs. 77 and 78. It will be noted that it is the height 
of the bar in each ease that measures the probability of the attri- 
bute at which it is erected. 

Continuous Probability Distributions. If the members of a 
set consist of the numerical figures obtained by the repeated 
measurement of the length of a given table or the continued 
measurement of the heights of adult white males, living now and 
in the future, the attributes assumed by the members may form a 
continuous variable. In such a case, the total probability of 
l ean be considered to be distributed over the whole range 
of variation; it will thus form a "continuous" distribution of 
probability. More exactly, the range may be divided into small 
class intervals, and location within one of these intervals may be 
taken as the attribute of a member of the set. In this instance, 
the probability of a member belonging to a given interval may be 


‘represented by the area of a rectangle erected over that interval, 
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` and the total distribution may be pictured as a set of such rec- 
tangles in the manner shown in Fig. 79. Tf, now, the class inter- 
vals are made smaller and smaller, the tops of these rectangles 
will tend to sketch out a smooth curve (¢f. Fig. 80). A prob- 
ability curve of this sort can be looked upon as the limit 
approached as the class inter- 

vals into which the range is 

divided are made infinitesimally 


small. 
WoW 8 0 CR Frequency Distributions as 
Inches Probability Distributions. 


Fro, 79.—A continuous distribution of From the definition of proba- 

Ge bility given above, it follows 
that any frequency distribution in which the frequencies are 
expressed as a percentage of the total number of cases is a 
distribution of probability of the given set of cases. It likewise 
follows that a frequency curve that represents the distribution 
of relative frequency of an infinite population of cases! is also a 
probability eurve. 


4 4 ü P n N 
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Inches 
Fia. 80.—A probability curve. 


Since a distribution of relative frequency and a distribution of 
probability are thus one and the same thing, all the measures of 
the various characteristies of frequency distributions automati- 
cally apply to probability distributions, Thus a probability 
distribution has a mean, a standard deviation, a coefficient of 
skewness, and a coefficient of kurtosis, like any frequency 
distribution. 


1 See p. 238. 
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ALGEBRAIC AND GRAPHIC REPRESENTATION OF THE NORMAL 
FREQUENCY CURVE 


Functional Relationships. Before entering upon a mathe- 
matical and graphic representation of frequency and probability 
curves, it may be well to review briefly the algebraic and geo- 
metric description of simple functional relationships. This is 
the purpose of the present section. 

If one quantity varies when a second varies, the first is said to 
be a “function” of the second. The pressure of air in an auto- 
mobile tire, for example, varies with the temperature; pressure is 
thus a function of temperature. Again, the quantity of butter 
bought varies with its price; hence, the purchase of butter is a 
function of its price. 

Functional relationships of this kind are described symbolically 
by such expressions as y = f(x), y = F(x), y = G(x), y = e(z), 
and y = ¥(«)—all of which are to be read “y is a function of x” 
or, more specifically, “y is the f function of x,” “y is the F func- 
tion of a,” ete. The expressions y = f(x) and y = F(x) are the 
most common; the others are often used, however, when a 
problem involves more than one functional relationship. For 
example, if y and z are both functions of x, this may be expressed 
by y = f(x) and z = g(a). 

Frequently a quantity varies, not merely with one, but with a 
number of other quantities. The former may then be said to be 
a “joint function” of the latter. Thus the volume of gas in a 
tube is a function of the pressure and the temperature (Boyle’s 
law); the quantity of butter bought is a function of the price of 
butter and the income of its purchasers. Joint functional rela- 
tionships of this kind are expressed by y = f(z;z), y = F(a,2), 
y = e(z,z), ete.,—all to be read “y is a function of z and z.” 

Explicit and Implicit Functions. T he functional relationships 
so far considered are “explicit” functions. In explicit functions 
one variable is selected as the dependent variable, and the other 
or others as the independent variables; this is indicated by writ- 
ing the dependent variable to the left of the equal sign. Often, 
however, it is convenient to talk of two variables as being func- 
tionally related without indicating which is to be taken as 
dependent and which as independent variable. Such a func- 


2 
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tional relationship is indicated by f(x,y) =0, F(x,y) = 0, 
e(x,y) = 0, ete., or if there are more than two variables, by 
f(a,y,z,) = 0, F(a,y,z) = 0, (x,y,z) = 0, ete. These all mean 
that z and y or z, y, and z are “functionally related.” Functions 
of this kind are called “implicit” functions. An explicit funetion 
can often (although not always) be derived from an implicit 
function by merely solving the latter for the variable selected as 
dependent. 

The simplest type of functional relationship is expressed by a 
polynomial. A polynomial in z means such expressions as + 
(strictly speaking a monomial), a + x, a + ba, and a + bz + cx’. 
The “degree” of the polynomial is the highest power of z that 
occurs in the expression. Thus a + bz is a polynomial of the 
first degree; a + ba + cx? and a + ert are polynomials of the 
second degree; and a+ bz + cr? + dr’, a+ bx + dz, 
a + cx? + de, and a + da? are polynomials of the third degree. 
Polynomials in two variables, say z and z, are illustrated by 
a+ bz + gz, a + ba + cx? + gz + he, a + bx + gz + mez, and 
a + bu + kz’, the first being of the first degree, the second and 
third of the second degree, and the last of the third degree. 

If y varies by a constant absolute amount every time z varies 
by a fixed given amount, the function that expresses this rela- 
tionship is a first-degree polynomial in z, such as y = a + be. 
For every time z increases (or decreases) by one unit, y increases 
(or decreases) by b units. Thus, if y = 10 + 2a, y increases (or 
decreases) by two units every time x increases (or decreases) by 
one unit. If b is negative, then the variation in y is opposite in 
direction to that of z. For example, if y = 10 — 2x, then y 
decreases (or increases) by two units every time’ x increases (or 
decreases) by one unit. The quantity a is the value of y when 
x equals 0; if y = 0 when z = 0, then a must be zero. 

When a functional relationship can be represented in this way 
by a first-degree polynomial, such as y = a + bz, then y is said 
to be a "linear" function of z. This continues to hold true 
when there is more than one independent variable. Thus 
y = a + bz + gz is said to express y as a linear function of 
zandz. If the change in y that accompanies a given change in z 
varies with the value of z, then the functional relationship can 
no longer be expressed by a first-degree polynomial in x. The 
function is in this ease “nonlinear,” and a higher degree poly- 
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nomial or some more complex expression must be employed to 
indicate the relationship. A nonlinear relationship between y 
and two or more variables, such as x and z, must also be expressed 
by some function other than a first-degree polynomial in these 
two variables. 

Graphs of Simple Functions. A pair of values, z and y, may 
be represented by a point in a plane. The point P in Fig. 81, 
for example, represents the value z = xo and y = yo; i.e., the 
coordinates of P are (xo,yo). 

First-degree Polynomials. The graph of a first-degree poly- 
nomial is a straight line (hence the name “linear” relationship), 


Via, 81.—Plotting of a point. Fig. 82.—Graph of a straight line. 


and conversely every straight line may be represented algebrai- 
cally by a polynomial of the first degree. The simplest way to 
comprehend this relationship between the algebraic and the 
geometric presentation is to think of a straight line in reference 
to (1) the angle it makes with the z-axis and (2) the intercept 
it cuts on the y-axis. Thus, in Fig, 82, let 0 represent the angle 
CAB that the line makes with the a-axis at A. The tangent of 
this angle is the slope of the line AB. Let b represent this slope; 
then b = tan 8. It is evident that a straight line is determined 
when its slope b and its y intercept, a (or OB), is found, as follows: 
Let P (whose coordinates are any z, y) be a representative point 
of the line. Take PC, the perpendicular to Ox, from P, and 
BD, the perpendicular to PC, from B. Then, 


tan DBP = P5 
or 
DP — tan DBP x BD (1) 
But 


DP = CP —CD = CP —-OB=y-a 
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BD = OC =x and tan DBP = tan CAP 
(similar triangles) 
and 
tan CAP = tan 0 = b 


Now, since DP = y — a, and tan DBP = tan 0 = b, and BD = z, 
then, upon substituting in Eq. (1), 


y— a= bx or y = a + be (2) 


A straight line thus represents y as equal to a polynomial in a 
of the first degree. When the line passes through the origin, 
a = 0 and the functional relationship becomes simply y = bz. 

Second-degree Polynomials. If the functional relationship 
between z and y takes the form of a second-degree polynomial, 
such as y =a + bz + cz2, its 
graph will be a parabola. By 
definition, a parabola is the locus 
of all points equidistant from a 
fixed point called the “focus” and 
a fixed line called the “directrix.” 
In Fig. 83, the focus is the point 
(F), and the directrix is the line 
Vig. SE d Kee? RS (y + F = 0). Take any point 

on the parabola, P(z',y), and 
draw from it a line perpendicular to RS, at point M; then draw 
a line from the focus (F) to the point P. 

At the point z = 0, y = 0, it is obvious that the parabola, is 
equidistant from the directrix y + F = 0, or y = —F, and the 
focus at z = 0, = F. 

Now, if a line were drawn from P perpendicular to the y-axis, 
at C, it is clear that FP? = (y! — F)? + (z^), since FC = y — F 
and PC = 2’. This is true since the square of the hypotenuse is 
equal to the sum of the squares of the other two sides of a right- 
angled triangle. Furthermore, it can be seen from Fig. 83 that 
MP? = (y + F): because MP is y'+F. By hypothesis, 
MP: = FP?, and hence, by substitution, 


DEER 


which is true for any value of 2’ and y’; that is, any z and y, and 
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which by transposition and simplification becomes 
x 

2 = egen 
x? = 4Fy or Y= dp 
This is of the general form of y = cz?. If the curve had not 
passed through the origin, its equation would have been of the 
form y = a + cz?; and if its vertex had come either to the right 
or left of the y-axis, its equation would have been 


y = a + br + cx? (3) 


If the value of ¢ is negative, the curve turns down as in Fig. 84 
instead of up as in Fig. 83. "These parabolic curves thus illus- 
trate the form of a functional relationship in which y is set equal 
to a second-degree polynomial in z. 


Fic, 84.—Graph of a parabola, Fia. 85.—Graph of a circle, 
y = —cz* (c > 0). z? +y? =r 


The Qircle. The implicit functional relationship 
ED (4) 


is of interest in that its graph is a circle. By definition, a circle 
is the loeus of all points equidistant from a fixed point called the 
“center.” In Fig. 85, the center is taken at the origin. By the, 
property of right-angle triangles, the square of the distance of 
any point P from the center at O (the origin) is simply x? + y. 
Since by definition this must be the same for all points on the 
circle, the equation of the circle is simply z* + y? = 7°, where 
ris the radius. By transposition, this becomes x? + y? — r° = 0. 
If the center of the circle had been at the point (a,b) instead of 
at the origin, its equation would have been 


(z — a)? + u-d) — r: = 0 


260 THE NORMAL FREQUENCY CURVE 

An implicit functional relationship, therefore, in which two 
variables enter to the second degree with identical coefficients 
but in which there is no cross-product term (such as xy) is simply 
the algebraic expression for a circle. 

The Ellipse. I£ the functional relationship represented by a 
circle is modified so that the coefficients of the second-degree 
terms are no longer identical (but still retain the same signs), its 
geometric counterpart is distorted so as to become an ellipse. 
Thus the graph of the implicit functional relationship 


ax? + by? — 1? = 0 (5) 


is an ellipse whose semimajor axis is r/4//a and whose semiminor 
axis is r/+/b (cf. Fig. 86). If b is less than a, then the ellipse 
runs the other way, as in Fig. 87. 


y 
y 
Ke Wë S 
Fra. 86.—Graph of an ellipse, ax? Fic. 87.—Graph of an ellipse, az? 
+ by? — r: = 0 (a <b). + by? — r? = 0 (a > b). 


It will be noted that, if either of the implicit relationships 
a+ y? — ra = 0 or ax? + by! — r? = 0 is solved for y, the 
resulting solution gives y, not as a single-valued function of x, but 
as a double-valued function. Thus, in the case of the circle, 
y = wr — x, which shows that, for each value of i, there are 


two values of y. Likewise, for the ellipse, y — £s = = 
which again gives two values of y for each value of z. Geo- 
metrically this means that a line perpendicular to the x-axis 
cuts the circle and ellipse at two different points. In contrast, 
the straight line and the parabola express y as a single-valued 
function of z. The difference is due to the fact that, in the case 
of the circle and ellipse, it is y? and not y that is expressed as a 
polynomial in x and when the square root is taken two values of 
y result. 
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Rising Exponential Function. A simple function that does not 
express y or y? as a polynomial in z is the exponential, or “com- 
pound-interest," function, Suppose that a sum of $1 is put 
out at interest at 6 per cent per year compounded for 3 years; 
then the value of this sum at the end of the 3 years would be 
as follows: 


Value at end of 3 years = (1 + 0.06)(1 + 0.06) (1 + 0.06) 
= (1 + 0.06)* 


In general, if $1 is put out at interest for z years and compounded 
at the rate of r per annum, its value at the end of the x years 
would be as follows: 


Value at end of z years = (1 + r)” 


if the interest is compounded every 6 months instead of every 
year, then the value at the end of z years would be as follows: 


Value at end of z years = ( + y 


for the interest rate for 6-months is just half the rate for a year 
and the number of 6-month periods is just double the number of 


years. If the interest is compounded every quarter, then the 
Ax 


value of the $1 at the end of x years would be | 1 + i ; and 


if the interest is compounded every 1/nth of a year, the value at 


the end of z years would be (. + a , r always being the rate of 


simple interest per year. This last may be written 


If, now, n is made infinitely large, in other words, if the period of 
compounding (that is, 1/nth of a year) is made infinitesimally 
small, so that the operation of compounding may be viewed: as 


practically continuous, then the quantity ic l' is equal 
7 
approximately to e, the base of the Napierian system of loga- 


e 
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1\" 
rithms. For, by definition, e is the limit of (: + z) as m 


approaches infinity. The value of $1 at the end of z years when 
interest at r per annum is compounded continuously is therefore 
er 

The quantity e, it will be recalled, is a numerical constant 
equal approximately! to 2.7183, so that the value of a sum 
compounded continuously for z years is given by a function of 
the form y = aè, where a and b are merely constants. The 
compound-interest function is thus an “exponential” function, 


y 
y-e"™ 
X X 
FiG.88.—Graph of a rising exponential Fic. 89.—Graph of a declining ex- 
function, y = e'* (r > 0). ponential function, y = e7* (r > 0). 


for the variable x enters into the function as an exponent. A 
graph of this function for a positive value of b(— r) is shown in 
Fig. 88. 

Declining Exponential Function. The “present value” of $1 
due z years hence, interest being compounded continuously at 
the rate of r per annum, would be equal to ez. For since the 
sum of $1 would accumulate to ez dollars at the end of x years, the 
sum of e7*(— 1/e'*) dollars would accumulate to 


et erz = e? = 1 dollar { 


1 This may be easily proved by putting increasingly large values of m in 
the formula e = D + z) . Thus, 


for m = 10, = (1 + xy)! = (1.1)1° = 2.594; 
for m = = (1 + $)" = (1.02)? = 2.691; 


1 1,000 
for m = 1,000, e = D ES Lx) = (1.001) = 2.7171; 


for m = 10,000, e = (1 + 1/10,000)1%%% = (1.0001)1%% = 2.7182, the 
calculations being carried out by logarithms (see Appendix, Table I). 
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at the end of that time, and hence the present value of $1 due z 
years hence is e, This “discount” function, as it may be 
called, is thus a negative exponential function. Its graph is 
shown in Fig. 89. 

Whereas the exponential functions are not themselves poly- 
nomials, it will be noted that the expression that constitutes the 
"exponent of e” is a first-degree polynomial in z (more accu- 
rately, the monomial x). This 
suggests that interesting modi- 
fications of the exponential func- 
tio might he obtained by 
making the exponent of e a 
second or higher degree poly- . 
nomialin x. A function of this 
kind that is of very great impor- Tia OOS Tiley Bellcghaped orma) 
tance in the theory of frequency EE E 
curves is y = em, Figure 90 
shows how well this function represents the general shape of a 
frequency distribution; as will be seen shortly, it is the kernel of 
the formula for the normal frequency curve. 

Formula for the Normal Frequency and Probability Curves. 
Formulas for frequency and probability curves may be written 
in two ways. The first, which may be symbolized by y = e(X), 
merely expresses the ordinate of the curve y as a function of the 
attribute X whose frequency is being measured. (¢ is used here 
to signify “function of,” instead of the usual F or f, in order to 
avoid confusion with the employment of the latter to indicate 
frequencies.) In this form the formula simply describes the 
locus of the points that constitute the frequency or probability 
curve. 

The second method of writing a frequency or probability 
formula is d(F/N) or dP = e(X) dX. In this form, the prob- 
ability or relative frequency of a case lying between X and 
X + dX is expressed as a function of the attribute X. It will 
be recalled that a probability or frequency curve is the limit 
approached by an area histogram as the class interval is made 
infinitesimally small. The expression d(F/N) or dP = g(a) dX 
merely says that, when the class interval (of size dX)! is made 


1 The letters dX, dP, d(F/N), ete., are to be read as a single symbol and 
not as the produet d and X or d and P, ete. The symbol dX means an 
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infinitesimally small, the area under the curve for any class 
interval [that is, d(F/N) or dP] is approximately equal to the 
area of a small rectangle whose base is dX and whose height is 
the ordinate of the curve e(X) at X (cf. Fig. 91). This second 
method is, strietly speaking, the proper method for describing a 
"probability" or "frequency" 
curve, since it is this and not the 
first which actually expresses the 
probability or frequency of a given 


Area under 
curve, ie, 


SC class interval as a function of the 
imately by attribute X. The former gives 


area of this 
rectangle ~~ 


merely an algebraic expression for 
the curve and not the area under 
the curve.? 

The Normal Curve. One curve 

dx oecurs very often in statistical 

Fig. 91.—Graphical representa- analysis, especially in the theory 
tion of a probability function. of sampling. This is the normal 
frequency curve. Its algebraic and graphic representation may 
profitably be illustrated by a brief description. 

The mathematical formula for the normal curve is 


1 UE X: 

dP = go S XS (6 

c N/2rm i 

where e(= 2.7183+) is the base of the Napierian system of 

logarithms, X is the mean of the distribution, and ø is its stand- 

ard deviation. Pictures of the curve are given in Figs. 92a and 

92b. Tt will be noted that the curve is symmetrical and gen- 
erally bell-shaped. 

Owing to its symmetry, the center of the curve comes at the 

mean point ge = X. Here the curve also reaches its greatest 


height, viz., — : and from there it slopes gradually downward 


SS 


on each side as the factor e ? 


ssumes greater and greater 


SEET small part of the X range, tllo symbol dP means an infinitesi- 
mal probability, and the symbol d(F/N) means an infinitesimal relative 
frequency. 

1 The first method of description is, however, the only form that is appor- 
priate for a discrete distribution. 
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significance.! The points of inflection, i.e., the points at which 
the left and right branches change from concave to convex 
downward fall at X = X + ø. About two-thirds of the area 
under the curve lies between these two points of inflection, and 
about 95 per cent between X — Ge and X + 2c. The average 
deviation equals 0.7979 times the standard deviation, and the 
fourth moment equals three times the square of the second 
moment (hence 8» = 3). 


X-4 X-o 
Fia. 92a,—Graphs of normal frequency curves with different means but same 
standard deviations. 


As will be noted from the formula, a particular normal curve 
is determined when its mean X and its standard deviation o are 
given. Different normal eurves, then, will have either different. 
means or different standard deviations, or both. If the means 
are different, the curves will have different positions on the X- 
axis; if the standard deviations are different, the curves will be 
of different widths. This is illustrated in Figs. 92a and 92b. 

Although normal curves may thus differ in respect to their 
means and standard deviations, they all possess the essential 
“normal” form. This may be brought out by measuring the 
attribute X, not in original units, such as pounds or dollars, but 
as a deviation from the mean of X measured in standard-devia- 


l/e =} — 1. As X moves away from X in either direction, e 


becomes smaller and smaller. 


becomes larger and larger and hence ; 


e 
All the ordinates are positive since the exponent of e is squared. 
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La 


X- 


: i : x - SAT. 
tion units, i.e., as Wa? Whenever this is done, all 


normal curves become one and the same curve, 

Suppose, for example, that X in one case represents pounds 
and in another quarts. In both cases the probability or relative 
frequency is “normally” distributed, but the mean of the first 
distribution is 4 pounds and its standard deviation is 1.5 pounds. 
This distribution is represented by curve A in Fig. 92a. The 
second distribution has a mean of 6 quarts and a standard 
deviation of 0.5 quart; it is represented by curve D in Fig. 92b. 
If, however, the unit adopted for the measurement of the attri- 


bute of an element is in the first case Ib and in the 


X*6 
Fic. 92b.—Graphs of normal frequency curves with same mean, but different 
standard deviations, 


T then in terms of these E zu = S 
units the two distributions will be identical, It is thus possible 
to reduce all normal distributions to a standard normal form! 
(see Fig. 93). 

Consider now the relationship between the mathematical 
formula for the curve and this standard normal form.. In the 
individual or nonstandard form, the formula for the curve is 

e : 1 H 
j y dE 
that given above, viz., dP VOR e 
the infinitesimal portion of the total area under the curve dP cut 
off by the infinitesimal class interval X to X + dX is approxi- 
mately equal to the area of an infinitesimal rectangle whose 


second case 


This says that, 


1Jt will be noted that in the standard form the height of the middle 
ordinate is Liv Ss = 0.3989. 
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pote See 
e Zei and whose base is dX. If now the 
s y/ 2n i: 


GIS 


height is 


F S 2 Ao 2 
attribute is expressed in - z units, the mathematical 


ap el, C 


formula becomes, 


2r 


In this case, the infinitesimal portion of the total area cut off 


NA : cS. £ FW 
by the infinitesimal class interval z to = + d G is expressed as 


a eeu e ER 
the area of a rectangle whose height is Vas e iC) and whose 
m 


Wa 


Se CO ea x 


Fia. 93.— Graph of a standard normal curve. 


base is d(a/c). In other words, the effect of measuring the attri- 
bute in a/o units instead of X units is to change the size of the 
unit class interval on which the rectangle rests but at the same 
time to change its height proportionately so its area dP remains 
the same. 


— 


DE D 
A table giving the value of Ya i for various values of 
"T 


z/s will be found in the Appendix, Table VII. These are the 
ordinates of the standard normal curve. It was by means of 
this table that Fig. 93 was plotted; its use in “fitting” the normal 
curve to an actual set of data will be explained in the next 
chapter.! 


1 See pp. 277 and 295-304. 


CHAPTER X 
PROBABILITY CALCULUS 


Two Fundamental Theorems. There are two fundamental 
theorems in the calculus of probabilities. These are the addition 
theorem and the multiplication theorem. There is also a special 
form of the latter that pertains only to “independent” attributes. 

Addition Theorem. The addition theorem pertains to the 
summation of the probabilities of one and the same probability 
set. Since the several attributes of a given probability set are 
necessarily mutually exclusive, it follows that the relative fre- 
quency of cases having either one of two attributes is the sum 
of the relative frequencies óf the separate attributes. For 
example, there are 13 hearts in an ordinary deck of 52 playing 
cards, and also 13 diamonds. The relative frequency of a heart 
is therefore 3$, and the relative frequency of a diamond is also 13. 
The relative frequency of a red card, i.e., either a heart or a 
diamond, is 2$, which equals $$ + 1$. Since the relative fre- 
quencies of the attributes are by definition their probabilities, it 
follows that the probability of a member of a set having either 
one of several attributes is simply the sum of the individual 
probabilities. The probability of a heart or a diamond is thus 
$$ + 43 = š$ = 3. This addition theorem is valid for infinite 
probability sets as well as for finite sets. 

Algebraically the addition theorem may be expressed as 
follows: If the attributes of a given set are Xi, Xs... , X, 
(representing either qualitative or quantitative characteristics) 
and their probabilities are pi, ps, . . . , ps, then the probability of 
X3, Xs, or Xs, say, that is, the attribute of being any one of these 
X's, is simply pı + pz + ps. If the variation in attributes 
within the set is continuous and if the distribution of probability 
is described by a formula such as dP = e(X) dX, then the prob- 
ability of an attribute within any one of a number of small 
ranges dX whose sum constitutes the range X, to X. is given 
by EdP = Xy(X) dX, or, in the symbolism of the integral 

268 
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calculus, f dP = f e(X) dX. The significance of this theorem 
will become clearer when its application to particular problems is 
considered. i 

The addition theorem is sometimes stated as follows: The prob- 
ability of either one of two mutually exclusive events (attributes) 
is the sum of their individual probabilities. This version of the 
theorem is perfectly valid if it is understood that the two attri- 
butes belong to the same set. It will be recalled that all the 
attributes of a set are mutually exclusive.! It is not true, how- 
ever, that all mutually exclusive attributes belong to the same 
set. To illustrate this point von Mises gives the following 
example:? 

Suppose that the probability of a man dying between his 
fortieth and forty-first birthdays is 0.011 and the probability of 
his marrying between his forty-first and forty-second birthdays 
is 0.009. These events are mutually exclusive, but it cannot 
be said that the probability of a man either dying in his fortieth 
year or marrying in his forty-first year is 0.011 + 0.009 = 0.02. 
The two events do not belong to the same set, and the addition 
theorem ean be validly applied only to attributes of one and the 
same set. 

Multiplication Theorem. The multiplication theorem pertains 
to the calculation of a probability of a “derived” or ‘second- 
order” probability set from the probabilities of two or more 
“first-order” sets. Consider, for example, an ordinary deck of 
playing eards and a pinochle deck. Each may be said to con- 
stitute a “first-order” probability set. In the first set the prob- 
ability of an ace is sfs = +s, and in the second the probability of 
an ace is 4% = 4. Let the third set be formed from these two 
first-order sets by combining each card of one first-order set with 
each card of the other first-order set. Furthermore, let the 
attribute of any pair be the values of the cards making up the 
pair, such as a king and a nine, an ace and a queen, a two and a 
ten, and the like. The probability of any one attribute in this 
second-order set, say the probability of a pair consisting of two 
aces, is the relative frequency of such pairs among all possible 
pairs that might be formed. It is the purpose of the multiplica- 
tion theorem to give a general rule by which a probability of a 


1See p. 252. 
2 Probability, Statistics, and Truth, p. 54. 
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second-order set of this kind can be computed from the prob- 
abilities of the first-order sets. 

In deriving the multiplication theorem, two cases must be 
distinguished, one pertaining to independent probabilities and 
the other to dependent probabilities. Consider first the case of 
independent probabilities. Suppose that in finding all the vari- 
ous pairs of cards that might be composed of one card from each 
deck there is no limitation on how the cards might be matched. 
Then each card of the ordinary deck will have associated with it 
every card of the pinochle deck. Hence the set of pinochle cards 
that are paired with the ace of spades, say, from the ordinary 
deck will be the same as the set of pinochle eards paired with the 
jack of diamonds, say, or in fact with any other card from the 
ordinary deck. 

Since the set of pinochle cards paired with each card of the 
ordinary deck is thus the same as the set of pinochle cards paired 
with every other card from the ordinary deck and since this 
common set is identical with the original pinoehle set, it follows 
that the probability of any given pinochle card in the set paired 
with any given card of the ordinary deck is the same as the 
probability of that pinochle card in the original pinochle deck. 
This being the case, the probability of a card from the pinochle 
deck is said to be independent of the attribute of the card from 
the ordinary deck. 

In general, the probabilities of set II are said to be independent 
of the attributes of probability set I if, when the members of the 
two sets are paired together, the probability of an attribute in 
the subset of members of set II paired with any given member of 
set I is the same as the probability of that attribute in the 
original set II. Symbolically, P(B) is independent of A if P(B/A), 
that is, the probability of B given A, equals P(B), that is, the 
probability of B. 

Consider once again the given example. Since each card of the 
ordinary deck is associated with each card of the pinochle deck, 
the total number of different! pairs of cards that can be formed is 


1 Different in the sense that the cards going to make up any pair are not 
precisely the same cards as those making up any other pair. This means 
that the two aces of spades, the two kings of spades, the two jacks of dia- 
monds, ete., in the pinochle deck must be considered as different cards, even 
though their value and suit are the same. 
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52 X 48, for there are 52 ways in which the first card can be 
picked and 48 ways in which the second card can be picked. 
Likewise, the number of combinations that would consist of two 
aces would be 4 X 8 = 32. Hence the probability of a pair of 
aces in the whole set of pairs of cards is 32/2,496 = 75. But 
from the calculations this is seen to be equal to ze X a = 5. 
In other words, the probability of a pair of aces in the second- 
order probability set is equal to the probability of an ace in the 
ordinary deck times the probability of an ace in the pinochle 
deck. 

The multiplication theorem for independent probabilities may 
thus be stated as follows: If p, is the probability of a member 
of set I having the attribute A and if q is the probability of a 
member of set II having the attribute B, and if q is independent 
of the attribute assumed by a member of set I, then the proba- 
bility of a member of the derived set I-II having the attribute 
A B, that is, the probability of a pair AB among all possible pairs 
of two elements from each of the two sets I and II, is the product 
of the probabilities p; and ç. Im simpler form, if P(B) is inde- 
pendent of A, the joint probability of A and B is equal to the 
probability of A times the probability of B, that is, P(AB) 
= P(A) - P(B). 

Consider next the case of dependent probabilities. Suppose 
that, in picking pairs of cards from the two decks, the following 
modifieation is introduced. Suppose that every time the card 
picked from the ordinary playing eard deck is an ace, a king, a 
queen, or a jack and that, before any selection is made from the 
pinochle deck, an ace, a king, and a queen from each suit is 
discarded from the deck and is replaced by a jack, a ten, and a 
nine of each suit. The pinochle deck would then contain the 
same number of cards as before, but it would have 4 aces, kings, 
and queens instead of 8, and 12 jacks, tens, and nines instead of 8. 
After the pinochle deck has been modified in this way, let a card 
be selected from it and combined with the ace, king, queen, or 
jack picked from the ordinary deck. If the card picked from the 
ordinary deck is not an ace, a king, a queen, or a jack, let no 
modification be made in the pinochle deck. The effect of this 
modification in the method of forming pairs of cards is to make 
the attributes of the second set dependent on those of the first. 
For the set of cards of the pinochle deck associated with an ace, 


) 
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a king, a queen, or a jaek of the ordinary deck is different from 
the set of cards of the pinochle deck associated with other cards 
of the ordinary deck. The probability of a given type of card 
from the pinochle pack now depends on the card from the 
ordinary deck; no longer does P(B/A) = P(B). 

The number of different pairs of cards that can be made in the 
way just described is 52 X 48 as before, for the first card can still 
be selected in 52 ways and the second in 48 ways. The number 
of pairs of cards consisting of two aces, however, is now 4 X 4 
instead of 4 X 8 as in the previous example. Hence the proba- 
bility of a pair of aces among the whole set of pairs is 

c PS oe s El 
52 X 48 ` 156 
It will be noted, however, that this probability of a pair of aces 
is the product of the probability of an ace in the ordinary deck 
(that is, 5) times the probability of an ace in the modified 
pinochle deck (that is, 5). In other words, the probability of a 
pair of aces is the probability of an ace in the ordinary deck 
times the probability of an ace in the pinochle deck given the 
selection of an ace from the ordinary deck. 

The multiplication rule for dependent attributes may thus be 
stated as follows: If members of probability set II are paired 
with members of probability set I in such a way that the prob- 
ability of an attribute in the subset of members of set IT paired 
with any given member of set I varies with the attribute 
of that member of set I, then the probability of the pair AB in 
the whole set of paired values is equal to the product of the 
probability of A in set I times the probability of B in that subset 
of set II associated with the given attribute A. Symbolically, 
the multiplication rule for dependent attributes is 


P(AB) = P(A) - P(B/A) 
that is, the probability of AB equals the probability of A times 
the probability of B given 4. Since, when P(B) is independent 
of A, P(B/A) — P(B), the multiplication rule for independent 
probabilities is a special ease of the general formula 

P(AB) = P(A) - P(B/A) 


The multiplication theorem for both dependent and independent 
probabilities is valid for infinite sets as well as finite sets. 
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The significance of independence and dependence may be 
illustrated by a case pertaining to real life. Suppose that one 
probability set consists of American fathers of the white race 
und the other probability set consists of their eldest sons. Sup- 
pose the attributes distinguished in each set ‘are dying from 
cancer, dying from heart disease, dying from tuberculosis, and 
dying from other causes. If, now, the probability of a son dying 
from tuberculosis, say, is greater, among those sons whose fathers 
died of tuberculosis, than among the whole group of sons, then 
the probability of a son dying from a certain cause is not inde- 
pendent of the cause of death of his father. If, on the other 
hand, the probability of a son dying from any particular cause 
is the same for the sons whose fathers died from cancer, the 
sous Whose fathers died from heart disease, the sons whose fathers 
died from tubereulosis, and the sons whose fathers died from 
other causes, t.e., if the probability of a son dying from any 
particular cause is the same, whatever the cause of the father’s 
death, then the probability of a son’s death is independent of the 
cause of death of his father. For example, a case of dependence 
would be the following: 


Probability of death of eldest son from 


Heart "Tubercu- Other 


Cancer disease losis causes 


Eldest sons whose fathers died of 


0.102 0.030 0.558 
0.151 0.041 0.590 
0.118 0.093 0.569 
0.112 0.042 0.631 
0.120 0.046 0.606 


A ease of independence would be that in which the figures 
in every row of every column were the same and these in turn 
were the same as the figures for “All sons." If the probability 
of a father dying from cancer was 0.228, from heart disease 
0.120, from tuberculosis 0.046, and from other causes 0.606, 
then, in the case of dependence represented by the above table, 
the probability of both a father and an eldest son dying from 
heart disease would be (0.120)(0.151) — 0.018. In the case of 
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independence, on the other hand, the probability of both 4 
father and an eldest son dying of heart disease would be 


(0.120)(0.120) = 0.014 . 


Illustrations. The following examples will help to illustrate 
the use of the addition and multiplication theorems in the cal- 
eulation of desired probabilities. Some will also serve to illus- 
trate the use of the normal probability curve. 

Examples Involving Discrete Distributions. Suppose that 
gambling game consists of the random tossing of five coins. 
You agree to pay your opponent a predetermined sum of money 
whenever all five coins turn up heads; he agrees to pay you a 
predetermined sum whenever any other result occurs. The ques- 
tion is: What should the odds be to make the game a fair one? 
The answer is obtained as follows: 

Assume that the character of the coins and the method of 
tossing are such as to cause each coin to tend to turn up heads 
and tails in equal proportion. In a large number of tossings, 
therefore, the probability of a head on each coin may be taken 
as $. Assume, also, that the method of tossing is such as to 
make the tosses of each coin independent of the others. Then 
the probability of heads on all five coins is 


@)@)@)@)@) = Q)* = zs 


Accordingly, the fair odds are 31 to 1. That is, the game will 
be fair if you agree to pay your opponent $31 every time five 
heads occur and he agrees to pay you $1 every time some other 
result occurs. Of course, in an actual game the assumption 
regarding the character of the coins and the method of tossing 
would have to be checked by examination of the coins and by 
trial tossings. This is an illustration of the multiplication 
theorem for independent probabilities. 

Another gambling game consists of the throwing of two dice. 
You agree to pay your opponent a predetermined sum when- 
ever a combination totaling 7 appears, and he agrees to pay 
you a predetermined sum whenever another result appears. 
Again the problem is to determine fair odds, This may be done 
by a combined application of the multiplication and addition 
theorems. 
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Assume again that the dice are of such a character and are so 
thrown that all faces tend to turn up in equal proporticns. The 
probability of any given result for each die is therefore §. The 
six possible combinations that add up to 7 are (1,6), (2,5), (3,4), 
(4,3), (5,2), and (6,1). If it is assumed that the dice are thrown 
so as to give independent results, then, by the multiplication 
theorem for independent probabilities, the probability of each 
one of the above combinations is ($)(3) = s. Any one of the 
combinations, however, will yield a total of 7. The probability 
of a total of 7 is therefore the probability of any one of these 
combinations, which, by the addition theorem, is 


(sts) + (ss) + Gas) + Go's) + G's) + Gis) = 36 = $ 


Ilence, fair odds are 5 to 1; that is, the game will be fair if you 
pay your opponent $5 every time a 7 occurs and he pays you $1 
every time some other total occurs. Again, in a real game the 
character of the dice and the method of throwing should be 
checked to see if the above assumptions are warranted. 
Consider still a third gambling game. Suppose that two cards 
are drawn at random from a well-shuffled pack, the suit is noted, 
the cards are returned to the pack, the latter is shuffled, and 
the whole operation is repeated. Each time the cards are all 
spades you agree to pay your opponent a predetermined sum; 
if they are otherwise, he pays you. What are fair odds? 
Assume as in the other games that the method of drawing 
cards and the method of shuffling are such that all cards tend 
to be drawn in equal proportion. The probability of a spade 
among the first cards drawn will thus be $3, assuming the usual 
deck of 13 spades, 13 hearts, 13 diamonds, and 13 clubs. If 
the first card drawn is a spade, the remainder of the deck con- 
tains 12 spades and 13 of each of the other suits. If in a large 
set of drawings each of these remaining cards tends to turn up 
in the same proportion as every other card, then the probability 
of a spade among the second cards drawn will be $f. This is 
the probability of a spade on the second draw, assuming a spade 
on the first draw. Then, according to the multiplication theorem 
for dependent probabilities, the probability of a spade on both 
the first and second draws is (#3)($?) = A, The odds will be 
fair, therefore, if you pay $48 every time two spades are drawn 
and your opponent pays $3 every time any other combination 
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is drawn. Again the assumptions would have to be checked in a 
real game. e 

Examples Involving Continuous Distributions. All the exam- 
ples so far have been concerned with discrete probability distribu- 
tions. Much of the practical work in statistics, however, is 
concerned with continuous distributions. As the first example of 
this kind, consider the distribution of heights of eighteen-year-old 
boys. The fitting of a normal curve to the heights of 300 
eighteen-year-old Princeton freshmen! suggests that in general 
the forces of nature are such as to cause a “normal” distribution 
of heights. If this is assumed to be the case, then the normal 
curve can be employed to calculate the probability of an eighteen- 
year-old boy having a height lying between any given range. 
This is done as follows: 

As indicated above,? if the distribution of probability follow: 
the normal law, then the probability of an attribute ranging from 
a/o to zde + d(x/c) is given by the formula 


where a/o represents a deviation of the attribute from the mean 
attribute measured in ø units, e is the standard deviation of the 
distribution, and d(r/e) is an infinitesimally small range. This 
represents approximately the area under the curve for the infini- 
tesimal range z/c to z/c + d(z/c). A finite range running, say 
from zx de to22/c, can be conceived as made up of a number of infini- 
tesimal ranges of size d(/c) ; and the probability of an attribute 
ranging from zi/s to a2/o is (by the addition theorem) merely the 
sum of the probabilities for each of these infinitesimal ranges, viz., 


or, in the notation of the integral ealeulus, 


it BN (5 
J wë d 


1 See pp. 295-306 and especially Fig. 101. 
? See pp. 264-267. 
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E 
In other words, the probability of an attribute ranging from 
z,/c to zs/o is simply the area under the curve for this range. 
This is graphically shown in Fig. 94 and is a direct result of the 
addition theorem. 

The area under the normal curve for any given range, might, as 
indicated, be found by evaluating the “integral” 


Las 
ei dl 
il Wn NG 


e 


This is not an easy task, however, even for those who understand 
advanced mathematics. | Consequently, tables have been pre- 


xz x0 ET 
Vic. 94.—1llustration of computation of probability of an a/o lying between 
mi/c and zs/c. 


pared that give the approximate areas under the normal curve 
for certain specified ranges and that permit by simple arithmetical 
operations the caleulation of areas for all other ranges. Such a 
table is Table VI of the Appendix, page 693. This gives the 
proportionate area under the positive half of the normal curve 
from the mean (ie = 0) to various selected points. Thus from 
the table it is seen that the proportion of the area lying under the 
normal curve from z/s = 0 to z/c = 0.2 is 0.07926. 

In addition, since the proportion of the area under the normal 
curve from z/s = 0 (the mean) to infinity is 0.50000, the pro- 
portion of area under the curve from any selected point to infinity 
can be readily calculated. Thus the proportion of the area under 
the curve from z/s = 0.2 to infinity is 0.42074 Ge, 0.50000 — 
0.07926), the proportion of area from z/e = 1.96 to infinity is 
0.02500 (¿.e., 0.50000 — 0.47500), ete. Owing to the symmetry 
of the curve, the same values hold true for areas from z/s = 0 to 
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z/s = — c. Thus the proportion of the area for the range from 
z/s = —0.2 to z/s = — « is 0.50000 — 0.07926 = 0.42074. 

To find proportionate areas for other ranges, it is necessary 
merely to add or subtract proportionate areas given directly by 
the table. Thus, the proportion of area from the range z/o = 0.2 
to x/¢ = 0.3 is the difference between the proportionate area 
from z/s = 0.3 to the mean, and the proportionate area from 
z/c = 0.2 to the mean, i.e., 0.11791 — 0.07926 = 0.03865. Like- 
wise, the proportionate area under the curve for the range 
z/c = —0.2 to z/s = +0.3 is simply the sum of the propor- 
tionate area from z/e = —0.2 to x/c = 0, and the proportionate 
area from x/o = 0 to x/o = 0.3, i.e., 0.07926 + 0.11791 = 0.19717. 
Proportionate areas for obscure points not given directly or 
indirectly by the table may be obtained by interpolation; usually, 
straight-line interpolation (Ze. the calculation of simple pro- 
portionate differences) gives satisfactory results. 

To make use of Table VI in a given problem it is merely neces- 
sary to convert the original measurements into deviations from 
the mean expressed in e units, i.e., to convert original units into 
cunits. The mean height of eighteen-year-old boys, for example 
(as estimated from the heights of eighteen-year-old Princeton 
freshmen of the class of 1943), is 70.47 inches, and the standard 
deviation of heights js 2.49 inches. Hence the probability of an 
eighteen-year-old boy 72 to 73 inches tall is given by the area 
under the normal curve from 


z _ 72 — 70.47 _ z _73—7047 _ 
GEET 0.61 to SE 1.02 
This, in accordance with the method outlined in the previous 
paragraph for calculating such an area, is 0.11707. Similarly, the 
probability of a boy taller than 74 inches is given by the area 
uader the normal curve from z Teo 7 A zl 1.42 to infinity, 
which the table shows to be 0.50000 — 0.42220 — 0.07780. 
Again, the probability that two boys picked at random should be 
taller than 74 inches is the product of the two individual proba- 
bilities (the multiplieation theorem for independent proba- 

bilities) or 

(0.07780) (0.07780) = 0.00605 
Table VI thus readily facilitates the calculation of probabilities 
whenever the primary distribution or distributions follow the 


normal law. 


CHAPTER XI 


SYMMETRICAL BINOMIAL DISTRIBUTION AND THE 
NORMAL CURVE 


INTRODUCTION 


The preceding chapters have been concerned with probability 
and the probability calculus. These were discussed for the 
purpose of providing tools for subsequent analysis. In this 
chapter the tools will be employed in developing a theoretical 
explanation of the normal frequency curve. The line of attack 
will be as follows. 

The argument will begin with an abstract study of a simple 
problem in combinatorial analysis. The basic data will be 10 
coins, each of which has two sides. These sides will be marked 
with a head or a tail, and each coin will have one head and one 
tail. 

The combinatorial problem will be the determination of the 
relative frequencies or probabilities of various types of combi- 
nations in the whole set of combinations that might be made 
from various arrangements of the given set of coins. Thus 
the theoretical problem will be the determination of the relative 
frequencies or probabilities of combinations having 0, 1,2, ... , 
10 heads in the whole set of combinations that might be con- 
structed from various arrangements of the 10 coins. 

In the terminology of probability, this combinatorial problem 
consists of the derivation of a certain second-order probability 
set from the elementary probability set. To put this in another 
way, the problem is to find the type of frequency or probability 
distribution that results from the combination of certain elemen- 
tary frequency or probability distributions. Attention will in 
particular center upon the form of the derived frequency or 
probability distribution. Exact and approximate formulas will 
be determined for this distribution. 

The purely theoretical part of the theory of the normal curve 
will thus be a set of problems in the probability calculus. What 
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is ultimately desired, however, is the explanation that this 
distribution affords of some of the frequency distributions that 
appear in real life, such as the frequency distributions of the 
heights of adult white males, the frequeney distribution of 
samples from a given population, and the like. This explana- 
tion will be undertaken after the eompletion of the eombinatorial 
analysis. 


SYMMETRICAL BINOMIAL DISTRIBUTION 


Derivation. As already suggested, the discussion of the theory 
of the normal frequency curve will begin with the analysis of a 
simple problem involving 10 coins. Each coin, it will be assumed, 
has two sides, one of which is a head, the other a tail. Since the 
probability of an object has been defined as its relative frequency 
in the set of objects to which it belongs, it may be said that for 
each coin the probability of a side being a head is 4 and the 
probability of its being a tail is also 3. The problem to be 
considered is this: If the 10 coins are combined in all possible 
ways, the selection of a head or a tail for any one coin being 
independent of the selection for other coins, what are the various 
types of combinations of heads and tails that will be produced 
and what will be the probability of each type in the set of all 
possible combinations? This is a straightforward problem in the 
theory of combinations and may be solved as follows. 

To facilitate the analysis, let the 10 coins be distinguished 
by the letters A, B, C, D, E, P, G, H, I, and J. A combination 
having 0 heads, for example, will be represented as follows, 


A B C D E F G H Ji J 
Hi qe ap mE rp neg M pee ig qe isa 
a combination having + heads as follows, 
ARMES MUT Spr ee ae e 
H ke T H H yu i5 i H 4 
ete. 
Consider first the combination having 0 heads. Since the 
probability of a tail on each coin is 4, the probability of 0 heads 
is ($)". For the probability of A being a tail is 4, and the same 


is true for B, C, D, E, F, G, H, I, and J. Furthermore, sinee the 
probability of a tail for any one coin is independent of what 
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the other coins are, the probability of the above result is, by 
the multiplication theorem, the product of the 10 independent 
probabilities, or (3)()@)(4)@)@)@)@)@@) = G)*. Finally, 
it is to be noted that this result can be obtained in only one way. 
Hence it is to be concluded that the probability of 0 heads is 
1/2 = 1/1,024. 
Consider next the following combination: 

A B Lë D E F G H T if 

H- de p eee ees Qin prem 

This is a case of 1 head. Since the probability of A being 
a head is 4 and the probability of each of the other coins being a 
tail is also 4 and since each of these results is independent of the 
others, it follows that the probability of this particular com- 
bination of heads and tails is again (3). But there are also 
other combinations having only 1 head. Such are 

A B C D E F G H I J 
T See SEO UI AE E 

AW T H d MN gt "n E T T 
P DN PEUT T H 

In faet, it is readily seen that there are 10 combinations 
altogether in each of which a different coin is the one being a 
head. ‘The probability, therefore, of any one of these 10 com- 
binations, Ze. the probability of a head-on some one and only 
one of the 10 coins is, by the addition theorem, 104)" = 10/1,024. 

Consider now the combination 

A B [^] D E F G H I J 
H H AN T T T 3 SÉ ek T 

This is a case of 2 heads. Since the probability of A being a 
head is 4, the probability of B being a head is + and the prob- 
ability of each of the other coins being a tail is likewise 4; and 
since each of these results is independent of all the others, it 
follows once more that the ‘probability of this particular com- 
` bination is (3)! 

But, as previously, this is not the only combination having 
2heads. The reader himself will be able to write down a number 
of other combinations in which only 2 heads appear. The 
question is how many different combinations of the 10 coins 
have 2 and only 2 heads? This is answered by the theory of 


T 
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permutations and combinations outlined in Chap. X3 Thus 
the number of different combinations of 10 coins taken 2 at a 


time is 


[Cf. Eq. (3), page 234.] There being, therefore, 45 different 
combinations, each of which has a probability of (4)!°, it follows 
that the probability of any one of them is 


1 10 45 
25 () = 1024 
The probability of other possible combinations is determined 
in a similar manner. In general, the probability of N, heads is 


(10)! GI 
(N!00 — Ny)! \2 


Thus the probability of 3 heads and 7 tails is 


10! £ 10 1 10 
d sb () 


The probability of 6 heads and 4 tails is 


LOF OPO LAM 
4161 OI 2i 210(3) 
etc. The results obtained by use of this formula may be tabu- 


lated as follows: 


"TABLE 19.—PROBABILITIES OF VARIOUS COMBINATIONS AMONG ALL PossrBus 
COMBINATIONS op 10 Corns 


Combinations Having Probability 
0 head 1/1,024 = 0.00098 
1 head 10/1,024 = 0.00977 
2 heads 45/1,024 = 0.04394 
3 heads 120/1,024 = 0.11719 
4 heads 210/1,024 = 0.20508 
5 heads 252/1,024 = 0.24609 
6 heads 210/1,024 = 0.20508 
7 heads 120/1,024 = 0.11719 
8 heads 45/1,024 = 0.04394 
9 heads 10/1,024 = 0.00977 
10 heads 1/1,024 = 0.00098 


i See pp. 232-234. 
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Tt will be noted that the series of probabilities of 0, 1, 2, . . . , 
10 heads may be obtained by the expansion of ( + 2)". This 
distribution of probability is consequently called a “binomial” 
distribution.t If N coins had been used instead of 10, the 
probabilities of the distribution would have been given by the 
terms of the expansion of ( + $)". Thus the probability of 
a combination having N, heads among all possible combinations 
of N coins is2 


N! p 
P(N) = gay — Ni (2 


or if Na is set equal to N — Ny, 


N! ID 
PNJ = NUN: G) a) 


This is the general formula for a symmetrical binomial dis- 
tribution. 

Character of the Symmetrical Binomial Distribution. A 
graph of the probabilities given in Table 19 is presented in Fig. 
95. It will be noted from the table and also from the figure that 
the probability of 0 heads is the same as the probability of 10 
heads, that the probability of 1 head is the same as the prob- 
ability of 9 heads, ete. In other words, the distribution of 
probabilities is symmetrical about a central point, in this case 
the point representing 5 heads. This symmetry is the reason 
for the name “symmetrical” binomial distribution. 

Mathematical analysis shows that in general the symmetrical 
binomial distribution has the following characteristics :* 


Mean = 2 
N 
E CT 
Bi = (2) 
2 
Ga Ol N 
1 Cf. p. 284. 
? [bid. 


3 These formulas are derived in Smith and Dunean, Sampling Statistics, 
pp. 65-67. 
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It will be sufficient to check these equations here by finding the 
mean, standard deviation, 8;, and Ge of the distribution of 
Table 19. 


025 
| The Binomial 
Distribution 
PIN, x d (By 
d “NON, 
0.5 H 
P(N,) 


= 


0.10 


0.05 


zl 


IS 
01234561890 X 


Fra. 95.—Graph of a symmetrical binomial distribution. 


The mean of a distribution of probability, it will be recalled, 
is equal to the sum of the attributes times their probabilities. ' 
The mean of the distribution of Table 19 is thus 


1 10 457 7 120 210 
1,024 + roa (D + 1.934 O troa © + Tong © 


252 210 VITAE 10 
* 102i ©) + 194 © + Toa O + 1924 © + goa (9) 


1 ‘ 
+ 1,024 de n 


According to the formula, the mean equals N/2 = 1$ = 5, which 

is the same value as that derived above by direct calculation. 
Similarly, the variance of a distribution of probability is equal 

to the sum of the deviations from the mean squared and multi- 
1 See p. 169. 
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plied by their probabilities. Hence, the variance of the dis- 
tribution of Table 19 is 

Msn te A eee Rich Seen 
1031 Y + Toga (79 + oz í S t E 

210 252 210 120 


+ $924 C7 D* + r 024 ©? + Fong (D* + Fong ©” 
1 


45 Jor " 
+ Tos (3)* + T 024 Sage 1,024 VISUS 


This again checks with the formula, which gives ei = N/A = 2.5. 

Likewise, the third moment about the mean of a probability 
distribution is the sum of the deviations from the mean cubed 
and multiplied by their probabilities, and the fourth moment is 
the sum of the deviations from the mean raised to the fourth 
power and multiplied by their probabilities. Thus, for the dis- 
tribution of Table 19, 


— .. DEN CAM E 
ga: i024 (5) + tog | Ar + 193 (79 

120 NL 0092 ase EE Zu 

1024 C72 + gaa C7 D* + 994 ©? + Tose 00 

120 

+7 024 


d 
45 LOD a Mae 
@)" + Toz G+ 1024 (9) + 1924 Ge 


and 
E: 

WE TE, 
120 210 -— 98 
+ 1924 72" + 1924 7 D* * 1024 (0) + 1 534 (D 


120 45 nur e EE 
+a O't pa O + ue 9" T 1024 [or mg 


T 2 AE 
(75 pog CO ech 


Ku 


Since, by definition, Bí = 43/m2 and 8» = pa/u3, it follows that for 
this distribution ñ, is zero and B2 = (25) = 2.8, which are the 
values again given by the general formulas. These formulas are 
valid for all symmetrical binomial distributions. 

The Normal Curve. If 40 instead of 10 coins were involved, 
the distribution of probability would be considerably more 
spread out than that of Table 19. This is readily seen from 
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Fig. 96. In general, the formula o = 4/N/4 indicates that the 
dispersion of the distribution increases in proportion to 4//N. 
If the horizontal seale is reduced, however, and the vertical scale 
enlarged, in the same proportion that the dispersion of the dis- 
tribution is increased, then the effect of increasing N is to bring 
the ordinates of the distribution closer together and to' raise them 
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Fra. 96.—Graphs of two symmetrical binomial distributions, one for N = 10, 
the other for N = 40. 

to the height of the original distribution. Under these condi- 
tions the tops of the ordinates tend to sketch out a smooth curve 
as N is increased. This is indicated in Fig. 97. It can be shown 
that the limit that the symmetrical binomial distribution 
approaches as N is increased, while at the same time the scales 
are adjusted in proportion to 4/N, is the normal curve 


be 
o WEN 


That the symmetrical binomial distribution approaches the 
normal eurve as a limit ean be definitely proved by rigorous 


Y= 
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mathematicalanalysis.! Certain general considerations, however, 
suggest this same conclusion. 

1. The distributions of Figs. 95 to 97 have a shape similar to 
that of the normal curve; and if a normal curve with the same 
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Fra. 97.—Illustration of effect of scale adjustments on a symmetrical binomial 
distribution. 


mean as any one of these distributions and the same standard 
deviation is graphed together with that distribution, the curve 
is seen to be a good “ft.” This is shown? in Fig. 98. 


1 ‘This is demonstrated in Smith and Duncan, Sampling Statistics, pp- 68-74. 

2 The binomial distribution is a discrete distribution, and its probabilities 
are correctly represented by a series of ordinates as in Figs. 96 and 97. It 
is the ordinates of the normal curve of Fig. 98 at Lie, 2/0, ete., and not 
sections of the curve area that approximate the binomial ordinates at these 
points. As pointed out, however, in Smith and Duncan, Sampling Statistics, 
p. 74, it is possible to represent any symmetrical binomial distribution by a 
histogram whose area is approximated by that of a normal curve. In this 
way the area tables of the normal eurve can be used to approximate a series 


of binomial probabilities. 
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Fia. 98.—Comparison of a symmetrical binomial distribution with the normal 
curve. 
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2. Equations (2)* show that 8: = 0 for the symmetrical 
binomial distribution and that 82 approaches the value 3 as N 
is increased, These are also the values of ñ, and f» for the 
normal eurve. 

3. If a graph is made of the symmetrical binomial distribution 
in the form of a frequency polygon, the relative slope of any side 
of this polygon at its mid-point is the same as the relative slope 
of a normal curve at that point. Figure 99 shows, for example, 
that for N = 10 the ordinate of the symmetrical binomial at 
N, = 6 is equal to 210/1,024 and the ordinate at Ni =7 is 
120/1,024. The mid-point between 6 and 7 is 6.5, and the 
ordinate of the polygon at that point is 


1 (210 4 120 \ _ 165 
21,024 ' 1,024/ ` 1,024 


‘The absolute slope of the polygon at this mid-point is given by 
the ratio of the difference between the ordinates at 7 and 6 (that 
is 162 er: iani =— | to the distance between the abscissa 
points 6 and 7 (that is, 7 — 6 = 1); and the relative slope at the 
mid-point is given by the ratio of the absolute slope to the ordi- 
nate at that point. Thus, the relative slope of the polygon of 
Fig. 99 at the mid-point 6.5 is 


290 

ni) C UE 
Te ride rd 
1,024 


In general, the relative slope of a symmetrieal binomial distribu- 


1 a (x Dt 3 v d 
tion at any mid-point N; + 3 is equal! to we, 


2 


If z is set equal to Ni + i — x, that is, the deviation of the 


5 : N 
mid-point from the mean 2, and if 2k? is set equal too" 


* Page 283. 4 
1 See Surrit and Duncan, Sampling Statistics, pp. 74-76. 
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š 2x 
this expression for the relative slope at point z becomes — gis 0r 
— = But it can be shown! by the differential calculus that the 
relative slope of the normal curve at any point z is —a/o?, where 
«=X — X. Hence the relative slope of the symmetrical 
binomial distribution at any mid-point is the same as that of a 


AN +1 
1 


normal curve whose standard deviation is equal to k = Y 


which, if N is large, is practically the same as the standard 
deviation of the symmetrical binomial distribution.? 


CONDITIONS PRODUCING THE SYMMETRICAL BINOMIAL AND 
THE NORMAL CURVE IN REAL LIFE 
The foregoing sections have been devoted to the derivation and 
description of a particular frequency distribution known as the 
binomial distribution. The analysis has consisted entirely of an 
application of the probability calculus, and the result is an 
abstract distribution of probability. Since the ultimate purpose 
of the analysis is an explanation of some of the frequency distri- 
butions of real life, it is desirable at this point to consider the 
question: What is the relationship between the symmetrical 
binomial distribution and a frequency distribution of real life? 
Consider first the following hypothetical experiment: Suppose 
that the 10 coins referred to in the theoretical discussion are 
tossed a large number of times and the relative frequencies with 
which they come up 0, 1, 2,..., 10 heads are computed. 
What will be the results? Actually, no precise prediction can be 
made. Intuition suggests, however, that, if the coins are sym- 
metrical and are tossed in an unbiased fashion, the relative 
frequencies with which the combinations 0, 1, 2,...,10 heads 
will appear will be close to the probabilities of these combinations 
among the whole set of combinations that could be made from 
10 coins. For if coins are tossed at random, it is to be expected 
that a head will appear on any one coin as often as a tail. The 
randomness also ensures that the appearance of a head on one 
coin will be independent of the appearance of a head or a tail on 
any other coin. Under these conditions it would seem likely 
1 Ibid. 
2 See Eqs. (2), p. 283. 
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that any particular arrangement of heads and tails would occur 
just as often as any other arrangement. Therefore, the relative 
number of times 3 heads and 7 tails would appear, for example, 
would be equal to the relative number of arrangements that 
would yield 3 heads and 7 tails out of the set of all possible 
arrangements. This is the relative frequency of the binomial 
distribution. Intuition thus suggests the results of random 
coin tossing will be closely approximated by the binomial fre- 
quencies. Actual experiments lend support to this argument, 
so that it would seem possible to predict the results of a large 
number of tossings by the use of the probability calculus. This 
is merely an application of the law of large numbers. 

The relationship between the results of coin tossing and the 
binomial probabilities suggests even more important inferences. 
For there may be conditions in real life that are similar to those 
involved in the tossing of coins, and statistical variables produced 
by these conditions may be expected to follow the symmetrical 
binomial distribution and in special instances the normal curve. 
To illustrate the conditions that might give rise to such results 
consider the following examples: 


Example 1. Suppose that the sex of the offspring of a certain animal is 
determined by the type of the egg cell in the female that unites with the 
sperm cell of the male, and suppose that the number of egg cells in 
the female that will produce male offspring is on the average equal to the 
number of egg cells that will produce female offspring. If sperm cells unite 
with egg cells in a random manner, the chance is } of an offspring being a 
male and } of its being a female. These are essentially the same conditions 
that determine whether a symmetrical coin should turn up heads or tails 
when tossed at random. Under such conditions the frequency distribution 
of the number of males in families of a given size should theoretically follow 
the symmetrical binomial distribution. ‘Thus families of size 5 should be 
expected to vary in sex combination as follows: 


Number Percentage of Families Having 
of Males Specified Number of Males 

0 ay = 0.03 

1 Ze = 0.16 

2 M = 0.31 

3 M = 0.31 

4 # = 0.16 

5 d; = 0.03 


A study of the sex of pigs in 116 litters of 5 pigs each showed the following: 


292 THE NORMAL FREQUENCY CURVE 
Number Percentage of Litters Having 
of Males Specified Number of Males 

0 0.02 
1 0.17 
2 0.35 
3 0.30 
4 0.12 
5 0.03 


The closeness of these figures to those above suggests that the theory of 
sex determination outlined above might very well be valid for pigs. 

Example 2. In Example 1, conditions were such as to produce a variable 
(number of males) that was discrete and integral. The present hypothetical 
example will suggest conditions which might produce a variable that was 
discrete but not integral and that was distributed in the form of a sym- 
metrical binomial. It also suggests conditions under which the variable 
might be practically continuous and distributed like a normal curve. 

Suppose that there are a large number of bags of flour, say 10,000, each 
weighing exactly 5 pounds. Suppose that an experimenter opens each bag 
in succession and adds or subtracts a certain quantity of flour to each bag 
in accordance with the following rule: Whenever he opens a bag, he also 
tosses 10 coins. For each head that appears he adds an ounce of flour to the 
bag; for each tail, he subtracts an ounce. The result of this procedure will 
be 10,000 bags of flour varying in weight from 5 pounds — 10 ounces to 
5 pounds + 10 ounces, the unit difference being 2 ounces. In accordance 
with the foregoing analysis, the distribution of the weights of these bags of 
flour would be approximately as follows: 


Weight of Bag Relative Frequency of Occurrence 
4 lb. 602. 1/1,024 
4lb. Bos, 10/1,024 
4 lb. 10 oz. 45/1,024 
4 Ib. 12 oz. 120/1,024 
4 Ib. 14 oz. 210/1,024 
51b. O0 oz. 252/1,024 
51b. 20z. 210/1,024 
51b. 402. 120/1,024 
51b. 6 oz. 45/1,024 
51b. Sos, 10/1,024 
5 Ib. 10 oz. 1/1,024 


In other words, the distribution of weights would approximately conform 
to a symmetrical binomial distribution with a mean weight of 5 pounds and a 
standard deviation of 2.5 3 = 5 ounces. 

This shows how a variable may be produced that is discrete but not 
integral and that is distributed in the form of a symmetrical binomial 
distribution. To produce a variable that is practically continuous, it is 
necessary to increase the number of coins from 10 to 100, say, and to reduce 
the ‘amount of flour added or subtracted to 0.01 ounce. Differences as 
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small as 0.02 ounce would thus be possible, and for all practical purposes 
the weight of a bag of flour could be said to be continuous. Under these 
conditions a graph of the distributions of weights would be practically con- 
tinuous and as indicated in the theoretical 
discussion would have the form of a normal 
frequency curve. 

Example 3. Example 2 was entirely 
hypothetieal. An apparatus has been 
construeted, however (see Fig. 100), that 
reproduces in somewhat different form the 
iditions of Example 2. By its use the 
results predicted in Example 2 can be 
coneretely illustrated. 

‘The apparatus of Fig. 100 was devised 
many years ago by Sir Francis Galton and 
subsequently elaborated by Karl Pearson.* 
As represented in Fig. 100 it consists of a 
ies of rows of wedges, each row contain- 
in additional wedge and so arranged 
that its wedges come halfway between the 
wedges of the row above. If the wedges 
are placed 1 centimeter apart, then a small 
ball dropped into the top of the machine 
will haye an equal chance in each row of 
being deflected 0.5 centimeter either to the 
left or the right. The apparatus of Fig. -543 2-10 1 23 4 5 
100 has 10 rows. The final deviation of Cm 
the ball from the central point O will thus Fra. 100.— The Pearson-Gal- 
be the algebraic sum of the left (minus) ton apparatus for physical 
and right (plus) deflections as it falls tee of a binomial distri- 
through the 10 rows. The possible range 
of this final deviation is from —5 to +5 centimeters. Since the probability 
of a plus and minus deviation of 0.5 centimeter is in each row equal to A 
(similar to the probability of a head and a tail for a coin) and since there 
are 10 rows (as there were 10 coins in the previous case), the probabilities of 
final deviations of —5, —4, —3, —2, —1, 0, +1, +2, +3, +4, +5 centi- 
meters will be the same as those of the binomial distribution, 


10! DAS 
P(N) = Fado — NO: (5) 
which are given in Table 19, page 282. 

1 GAUTON, Francis, Natural Inheritance (Macmillan & Company, Ltd., 
London, 1889), p. 63; PEARSON, Kart, “Skew Variation in Homogeneous 
Material,” Philosophical Transactions of the Royal Society of London, Series 
A, Vol. 186 (1895), p. 343. Pearson’s contribution was to replace the set of 
nails used by Galton by a set of sliding wedges that could be so adjusted 
that the chances of deflection to the left and right were not equal, Figure 
100 follows the pattern of Galton’s apparatus. 
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These are the theoretical probabilities of the apparatus. If a large 
number of balls are actually dropped into the machine, the exact result 
cannot be predicted. Intuition suggests, however, that the relative fre- 
quencies with which the balls will pile up in the different slots will tend to 
approximate the theoretical probabilities and this is demonstrated by actual 
experiments. Such a result is pictured in Fig. 100 by the shading of the 
slots in proportion to the binomial probabilities. 

Tt will be noted that in this case the variable, that is, the final deviation 
of a ball from the central point O is again discrete. Deviations of integral 
centimeters only are possible. If, however, the number of rows were 
increased from 10 to 1,000, say, and if at the same time the wedges were 
reduced to 0.01 centimeter in size and placed so that they were only 0.01 
centimeter apart (the balls would, of course, have to be correspondingly 
reduced in size), then the final deviations would vary by 0.01 centimeter 
and might be practically considered a continuous variable. The distri- 
bution of relative frequencies would in this case closely approximate a 
smooth frequency curve, which would once again be the normal curve. 

Theory of Errors. Errors in physical measurements may be broken up 
into several components. (1) The “instrumental error” may be attributed 
to the particular instrument with which the measurement is made; every 
measurement by it will contain a certain error that may be assigned to that 
instrument, (2) The “personal error" may be attributed to the particular 
person undertaking the measurement; every observation by him will be 
influenced by his “personal equation.” (3) Another component error may 
be attributed to particular external conditions, such as the temperature, 
sunlight, and wind. These errors due to the instrument, the observer, and 
specifie external conditions are all “systematic errors” that can be allowed 
for. (4) A final component error is the “incidental error," or "residual 
error,” to which no definite cause can be assigned, Such errors are the 
result of the whole host of chance forces, the same sort of forces that deter- 
mine whether an unbiased coin comes up heads or tails. 
dental error in any individual measurement may be taken to be the sum 
of a number of small accidental errors arising from different causes." Slight 
irregular changes in external conditions, such as the vibration on account of 
air currents or irregular changes in the personal equation of the observer, 
are examples of causes for accidental error of measurement. If it is pos- 
sible to discover the law of action of any error, it is thereby removed from 
the class of accidental errors to the class of systematic errors. 

If the number of forces affecting the residual errors in any series of 
measurements is large, if each causes a very small plus or minus deviation 
from the true value, and if the probability of a plus and a minus deviation 
is for each force equal to 4, then, as in the case of the flour-bag experiment 
and the Pearson-Galton apparatus, the final residual errors of the series of 
measurements will tend to be distributed in accordance with the normal 
curve. The mean of this curve will be the true value (after allowance has 


1 See Brunt, Davin, The Combinations cf Observations, pp. 3-4. 
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been made, of course, for the systematic errors mentioned above), and the 
standard deviation of the curve will be an index of the precision of measure- 
ment. This is the theory of errors.! It is supported by the close agreement 
between the normal eurve and distributions of actual measurements. In 
fact, the normal curve is often spoken of as the “error curve” or the “ Gaus- 
sian error curve,” after the man who was among the first to recognize the 
possibility of applying the theory of probability to the investigation of the 
errors of measurement.? 


Summary of Conditions Leading to the Symmetrical Binomial 
Distribution and the Normal Curve. The foregoing examples 
suggest that whenever the following conditions exist in real 
life, the data generated by these conditions will tend to be dis- 
tributed in the form of a symmetrical binomial distribution and, 
if certain other conditions are also present, in the form of a 
normal curve. The conditions giving rise to the symmetrical 
binomial distribution may be stated as follows: 

1. In the absence of certain “causes” of variation or in the 
event of a perfect balancing of their effects, the data assume a 
fixed central value. (The 5 pounds of the flour illustration, the 
“true value” in a series of measurements.) 

2. Deviations from this central value result from certain 
“causes”? of variation, the effect of any “cause” being either to 
add a fixed quantity to the data or to subtract the same quantity. 
(To add or subtract 1 ounce of flour or to add or subtract an 
“error” of 0.5 centimeter.) 

3. A “cause” of variation tends to produce positive effects and 
negative effects in equal proportion, that is, P(+) = P(—) = $. 
(The probability of a head equals the probability of a tail; the 
probability of a positive error equals the probability of a negative 
error.) 

4. The effects of all contributory causes of variation are of 
equal magnitude. (Each adds or subtracts 1 ounce of flour 
or 0.5 centimeter.) 


! Actually this is a special case of a more general statement of the theory. 
As pointed out in Smith and Duncan, Sampling Statistics, p. 97, each force 
may cause deviations of varying size with varying probabilities and the 
final residual errors will still tend to be normally distributed provided that 


the number of forees is very large and the relative importance of each is 


about the same. 
2 For Gauss's fundamental works see Abhandlungen zur Methode der 


kleinsten Quadrate (A, Borsch and P. Simon, Berlin, 1887). 
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5. The contributory causes are independent in their action. 
In other words, the contribution of a positive or negative effect 
by any causal factor is independent of the previous contributions 
of other causal factors. 

6. The total deviation of any element from its central value 
is the algebraic sum of the positive and negative contributions of 
the individual causal factors. (The total amount of flour added 
or subtracted from a bag is the sum of the ounces added for each 
head tossed minus the ounces subtracted for each tail tossed.) 

If in addition to these conditions, the following also exist, then 
the resulting distribution will tend to conform to the normal 
curve: 

7. The number of contributory causes is very large. (A 
large number of coins are tossed; the binomial machine contains 
a large number of rows.) 

8. The positive and negative contributions of each cause are 
very small. (If 0.01 ounce is added or subtracted instead of | 
ounce; if 0.005 centimeter, instead of 0.5 centimeter.) 

It is to be noted that, so far as the normal curve is concerned, 
not all these conditions are necessary for its generation, The 
above conditions will produce it, but the normal curve may also 
occur when some of these conditions are absent.! It may be 
stated here that the normal curve will still be produced if con- 
ditions 2 and 3 are relaxed so that a causal factor may affect the 
data in varying degree and with varying probabilities and also if 
condition 4 is only approximately and not exactly true.* The 
most important conditions are 6 to 8 and condition 4 in an 
approximate form. For example, in the case of the flour illus- 
tration, the resulting weights of the bags of flours would still tend 
to be normally distributed even if biased dice instead of unbiased 
coins were used and if the amount of flour added or subtracted 
varied with the result of the throw (say 0.001 ounce for the 
occurrence of a one, —0.002 ounce for the occurrence of a two, 
0.003 ounce for the occurrence of a three, —0.004 for the occur- 
rence of a four, etc.) provided that the number of dice thrown 
was very large and the amount added or subtracted per die was 
very small and of about the same order of magnitude from die to 

1 See Surrg and Duncan, Sampling Statistics, pp. 97-100. 


* ? Under certain eonditions the requirement of independence (condition 5) 
may also berelaxed. See Smrru and Duncan, Sampling Statistics, pp. 63-65. 
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dic. The normal curve is thus a more general phenomenon than 
ihe symmetrical binomial distribution.! 

Examples of Normal Frequency Distributions. Natural forces 
appear to generate normal frequency distributions in many fields. 
Physical measurements have already been mentioned. Figure 
101 shows the distribution of heights of 300 eighteen-year-old 
Princeton freshmen. The grades of students on examinations, 
hourly earnings of workers, the length of life of electric- 
light bulbs, the distance of baseball throws of first-year high- 
school girls are all normally distributed variables. In these 
ficlds and in many others, it would seem that the conditions of 
variation are those which theoretically give rise to the normal 


curve. 


62 63 64 65 66 67 68 69 TO 1! 12 73 74 15 76 T 18 
nches 


Fia. 101.—Normal curve fitted to heights of 300 Princeton freshmen. 


DETERMINATION OF NORMALITY 


Several procedures are available for determining whether the 
population from which a given set of sample data has been taken 
might reasonably be considered to conform to the normal curve. 
In general, these consist of comparing the histogram constructed 
from the sample data with a normal curve “fitted” to this histo- 
gram. The difference in the various procedures lies in the bases 


1 Mathematically the normal curve can be derived from a great variety of 
different assumptions. See, for example, CzuBER, EMANUEL, Theorie der 
Biobachtungsfeller (B. G. Teubner, Leipzig, 1891). 
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of comparison. Several of the more important procedures will 
now be discussed. 

Graphic Comparison. The simplest method of determining 
whether the assumption of normality is or is not reasonable is 
to graph the histogram and normal curve together and see how 
well the eurve.fits. The test here is purely a subjective one, 
but in many cases when the fit is exceptionally good or excep- 
tionally bad this is probably sufficient for acceptance or rejection 
of the hypothesis. 

In making a graphic comparison of a sample histogram and a 
normal curve, it is necessary to determine what mean and what 
standard deviation should be assigned to the curve. Offhand 
the simplest procedure would appear to be the assignment to 
the curve of the mean and standard deviation of the histogram, 
for presumably these are the best estimates that may be made 
of the mean and standard deviation of the population from which 
the sample was taken. It will be recalled, however, that in the 
caleulation of the mean and standard deviation of the histogram 
the data were distributed among various classes or groups and all 
the cases in any class interval were assumed to be concentrated 
at the mid-point of the interval. But the population is pre- 
sumably distributed in the form of a smooth curve, so that, in 
estimating its mean and standard deviation from that of the 
histogram, allowance must be made for the grouping of the data 
in the construction of the histogram. In any interval a smooth 
bell-shaped curve, such as the normal curve, will have more cases 
that are on the side toward the mean than on the side away from 
the mean. The assumption that all cases are concentrated at 
the mid-point of an interval will not cause any appreciable error 
in the mean caleulated from grouped data, for plus and minus 
deviations will offset each other; but it will eause the standard 
deviation of the grouped data to be greater than the standard 
deviation of the smooth curve that represents the true distribu- 
tion of the data. Some adjustment should therefore be made in 
the standard deviation of the histogram before it is taken as an 
estimate of the standard deviation of the population. 

The adjustment that must be made for grouping has been 
determined by W. F. Sheppard. He has shown that under con- 
ditions that are true for a normal distribution the variance 

1 Cf. pp. 318 and 319, 
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c? of the smooth curve is approximately equal to the variance of 
the grouped data minus one-twelfth the square of the class inter- 
val In other words, if us (uncorrected) is the second moment 
(= s?) of the grouped data about its mean and us is the second 
moment of the smooth curve about its mean, then 


p2 = us (uncorrected) — 45(1)? 


The quantity 35(Z)? is Sheppard’s correction for grouping that is 
required for estimating the standard deviation of the fitted 
normal curve. 

In fitting a normal curve to a sample histogram, therefore, the 
mean of the curve is taken equal to the mean of the histogram 
and the variance of the curve is taken equal to the variance of the 
histogram minus 45(7)*. In plotting the curve a table of the 
ordinates of the standard normal curve may conveniently be used. 
If the histogram to which the curve is to be fitted is of the usual 
type, that is, if it consists of a series of rectangles of which the 
heights measure aggregate frequencies and if the intervals on 
which these rectangles are erected are laid off in terms of original 
X units, then ordinates of the standard normal curve can be 
taken to represent the particular normal curve desired by making 
certain simple adjustments. The ordinates of the standard 
normal curve, it will be recalled, are given for values of X that 
are measured from the mean of the distribution and are expressed 
in terms of standard deviation units. It will also be recalled that 
the area of the curve over any given interval measures the relative 
frequency of cases falling in this interval. To make these 
ordinates represent a normal curve with a given mean and a 
given standard deviation, they need only be plotted so that the 
ordinate for X = 0 comes at the specified mean _value and 
ordinates for other values of X come at X = X +a. To 
put them on the same basis as the histogram, however, they must 
also all be multiplied by Ni/c. This is because the total area 
of the histogram? is Ni and that of the standard normal curve is 
1 (that is, 100 per cent), whereas the abscissa scale on which the 
histogram is plotted is z times the abscissa scale of the standard 
curve. This use of the ordinates of the standard normal curve 

1 Cf. Proceedings of the London Mathematical Society, Vol. 29, 353-380. 


2 The area of any one rectangle is Fi, and the total area is therefore 
XFi = Ni. 
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may be illustrated by fitting a normal curve to the heights of 
300 Princeton freshmen. 

In Table 20 the mid-points of the various class intervals into 
which the 300 heights were distributed are set down in column 
(1). In column (2) the difference between these mid-points and 
the mean of the distribution (X = 70.47) is computed, and in 
column (3) this is divided by the adjusted standard deviation. 
The results are the various values of z/c that correspond to the 
mid-points of the various class intervals. The ordinates of the 
standard normal curve at these values of z/c are then computed 
from Table VII (see Appendix, page 694) and entered in column 
(4). Finally, in column (5), these standard ordinates are multi- 
Ni _ (300)(1) 
ice E ` 
sample histogram. 

x? Test of Goodness of Fit. Another method of comparing a 
sample histogram with a normal curve is to compare the fre- 
quencies given by the two, interval by interval. Whereas the 
previous method was primarily subjective in that a conclusion 
had to be reached from a mere inspection of the two graphs, com- 
parison of the histogram and the curve, interval by interval, 
yields a numerical criterion of “goodness of fit." A procedure 
that has found favor because it permits a comparison with chance 
results is to take the difference between the absolute frequencies’ 
given by the curve and by the histogram for each interval, square 
these differences, divide each by the frequency of the curve for 
that interval, and finally sum the results. The quantity so 


d 2 
calculated may be represented by > e where F repre- 


plied by = 121.46 to put them on a par with the 


sents for each class interval the frequency given by the histogram 
and f the frequency given by the curve. 

Sampling theory shows? that, if this quantity is calculated for 
a large number (theoretically, an infinite number) of sample his- 
tograms from the same normal population, then the distribution 


of these various sample values of 2 E will be adequately 


represented by a probability curve known as the “x? curve” and 
this can be used to determine the probability of a larger value of 


1 For the curve, this means the relative frequencies times N, the total 
number of cases in the sample. 

2 On the x? distribution see Smith and Duncan, Sampling Statistics, pp. 
111-112 and Chap. XIII. 
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TABLE 20.—CALCULATION OF THE ORDINATES OF THE NORMAL CURVE 
Tuar Firs THE DISTRIBUTION or HEIGHTS or 300 PRINCETON FRESHMEN 


[05] (2) (3) (4) (5) 
š ER e EE ORE x aE 
62.5 —T.97 —38.22 0.00224 0.27 
63.5 —6:97 —2.82 0.00748 0.91 
64.5 —5.97 —2.42 0.02134 2.59 
65.5 —4.97 —2.01 0.05292 6.43 
66.5 —3.97 —1.59 0.11270 13.69 
67.5 —2.97 —1.19 0.19652 23.87 
68.5 —1.97 —0.80 0.28969 35.19 
69.5 —0.97 —0.39 0.36973 44.91 
70.5 0.03 —0.01 0.39892 48.45 
71.5 1.03 0.42 0.36526 44.36 
72.5 2.03 0.82 0.28504 34.62 
73.5 3.03 1.22 0.18954 23.02 
74.5 4.03 1.63 0.10567 12.83 
75.5 . 5.03 2.04 0.04980 6.05 
76.5 6.03 2.44 0.02033 2.47 
77.5 7.03 2.84 0.00707 0.86 


X = 70.47 e (corrected) = 2.47 


y € ss If the probability is a lan th 
2/87 by chance. the probability is a large one, then 
the difference between the given sample histogram and the 
pf 

normal curve, as measured by — » may reasonably be 
attributed to chance; the curve may be deemed a good fit, and 
the population from which the sample was drawn may tenta- 
tively be taken as normal. If the probability is very small, 
however, say less than 0.05, then the difference between the 
histogram and the curve is to be attributed to something else 
than chance, presumably to the nonnormality of the population 
from which the sample was drawn. In this case, the normal 
curve is not deemed a good fit, and the hypothesis of a normal 
population is rejected. Owing to its use of the x° curve, this 
second method of comparison is called the "x? test of goodness 
of fit." 

The x? test may be illustrated, as in the previous case, by the 
distribution of heights of 300 Princeton freshmen. The numeri- 


a 
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cal calculations involved are set forth in Table 21. In column (1) 
are put the upper limits (not the mid-points in this case) of the 
various class intervals of the histogram of heights, the first and 
last intervals being considered to run to — « and +, respec- 
tively. The deviations of these class-interval limits from the 
mean of the distribution are calculated in column (2), and their 
ratio to the “corrected” standard deviation is computed in 
column (3). In column (4) are written the areas under the 
standard normal curve from its lower limit (— %) to each of 
these class-interval limits, now measured in standard-deviation 
units. These areas are obtained from Table VI, page 693, 
of the Appendix, The figure in column (4) already gives the 
area for the first interval (— © to 63), and the areas for the other 
intervals can be computed by taking the difference between 
the area under the curve up to one class limit and the area up 
to the next higher class limit. These differences are written in 
column (5). In order to avoid very small areas in the extreme 
intervals (and hence a distortion of the test!), several groups at 
each end are amalgamated so that the areas for the first and last 
interval are at least equal to 5/N. 

The new arrangement is indicated in columns (1^ and (5’). 
Now it will be noted that the figures of column (5’) represent 
the relative frequencies given by the curve. To convert them 
to aggregate frequencies that are comparable with the aggre- 
gate frequencies of the histogram, it is necessary to multiply 
them by N (= 300), the total number of cases. This is done 

-in column (6). In column (7) are written the histogram 
frequencies for the intervals of column (1’), and in the remaining 
columns the differences between the histogram and curve fre- 
quencies are computed, squared, and divided by the curve fre- 


: : (p—y 
quencies. "The sum of the last column is the value of Kam NBI 


desired. In the present instance this is found to be 3.867. 
To determine whether the value of > eur = 3.867 
represents a good fit or not, turn to Table 22. Here are given 


ac s Fe (F — f} 
certain critical values for the statistic TNI RD The » of 


1 See ibid., p. 140. 
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the first column is equal to the number of class intervals minus 3. t 


F — p° 
The figures in the second column represent values of X SC 


for which there is a probability of 0.05 that an equal or greater 

value would be obtained by mere chance. For example, in the 

present instance, n = 11 — 3 = 8, and Table 22 shows that, if 
p 


F —f)2 
the data were truly normal, sample values for D a that 


were equal to or greater than 15.51 would be obtained only 
5 times out of 100 for such a value of n. Since the computed 


value of » vr — 3.867, the chances of an equal or greater 


_ D 
f 
- ? 
Values of » e. for 
Which the Probability of an. 
Equal or Greater Value Is 


TABLE 22.—CRITICAL VALUES FOR » Um 


n Just 0.05 
1 3.84 
2 5.99 
3 7.81 
4 9.49 
5 11.07 
6 12.59 
7 14,07 
8 15.51 
9 16.92 
10 18.31 
11 19.67 
12 21.03 
13 22.36 
14 23.68 
15 25.00 
16 26.30 
17 27.59 
18 28.87 
19 30.14 
20 31.41 


* Abridged from Table III, Table of x?, in R. A. Fisher, Statistical Methods for Research 
Workers, Oliver & Boyd, Ltd., Edinburgh, by the kind permission of the publishers and 
author, 


1 Bee Smith and Duncan, Sampling Statistics, pp. 327—328, for an explana- 
tion of the significance of n in this case. 
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value is much more than 0.05. Hence the curve is deemed a 
good fit, and the distribution of heights may be said to be normal. 

Comparison of Special Statistics. Although the test just out- 
lined is very commonly used, it has certain weaknesses as a test 
of normality. (1) It should be noted that the squaring of the 
differences between the group frequencies removes any signifi- 
cance that might be attributed to the signs of the differences. 
For example, it might happen in a given case that all the histo- 
gram frequencies to the left of the center were larger than the 
normal curve frequencies and that all the histogram frequencies 
to the right of the center were less than the normal curve fre- 
quencies, indicating a well-marked positive skewness; neverthe- 
less, if the absolute values of these differences were all small, the 
x? test might not indicate any departure from normality. (2) 
The necessity of combining the extreme intervals into larger 
groups causes a loss of information and reduces the number of 
points of eomparison.! For these reasons, other methods of test- 
ing for normality have been proposed. 

If a set of sample data actually has come from a normal popula- 
tion, it is to be expected that its skewness will be slight and its 
kurtosis close to the normal kurtosis of 3. It would also be 
expected that the ratio of its average deviation to its standard 
deviation would be somewhere in the neighborhood of the value of 
this ratio for the normal curve (¿.e., 0.7979). The departure of 
the actual values of these sample statistics from the theoretical 
values for the normal curve can thus be used as a test for normality. 

For the 300 Princeton freshmen, 8, 82, and the ratio of average 
deviation to standard deviation (indicated by the symbol a) had 
= 0.023, 8» = 3.021, and a = 0.805. These are 
e of n to be used in the x? table. 
orrection in computing these 
his test was computed from 


the values? 8, 


` Its practical effect is to reduce the valu 

2 No account was taken of ‘Sheppard’s e 
values. The average deviation used in making t! 
the mean by the formula 


AD. =} [> F (‘) | + (N: — Nu) + No G + e) |i 


= = d N. = number of cases in intervals below the arbitrary 
t 


where c = 

origin, NV, = number of cases in intervals above the arbitrary origin, and 

No = number of cases in interval containing arbitrary origin. (A must be 
+ n 

in the same interval as X.) Cf. Guary, R. C. and E. S. Pearson, Tests of 


Normality, p. 4. 
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all very close to the values 0, 3, and 0.7979 of a truly normal dis- 
tribution. ‘Hence this last, as well as the other tests, suggests that 
heights are normally distributed. 

Sometimes the sample values of 81, 8», and a are not so close to 
the normal values as in the foregoing illustration. In such 
instances, use may be made of tables published in Tests of Normal- 
ity by R. C. Geary and E. S. Pearson. These tables give, for 
various-sized samples, the sample values of 8: 8», and a, for 
which the probability of a greater value is 0.05, and 0.01, respec- 
tively. For 8s and a they also give values of these statistics for 
which the probability of a smaller value is 0.05 and 0.01, respec- 
tively. If, in any given instance, the sample value of 81, 8», or a 
falls outside the limits given for a probability of 0.05, say, then it 
may be concluded that the population from which the sample was 
drawn was not strictly normal. For the weights of the 300 
Princeton freshmen, for example, the sample values of 8; and 8s 
were 0.378 and 4.606. Both these are beyond the 0.01 probability 
point given by Geary and Pearson’s tables for a sample of 300 
(these were 0.329 and 3.79, respectively), and it may therefore be 
concluded that the distribution of weights is definitely not normal. 


llssued by the Biometrika Office, University College, London, and 
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CHAPTER XII 


USE OF THE NORMAL FREQUENCY CURVE IN SAMPLING 
ANALYSIS 


The normal frequency curve has its greatest usefulness in the 
theory of random sampling. While the full exposition of the 
theory of random sampling is beyond the scope of this book, some 
of the simpler aspects that relate to the use of the normal curve 
in sampling analysis are presented in the ensuing pages of this 
chapter. 


SAMPLING FROM A TWOFOLD POPULATION 


The Problem. An elementary problem in the theory of sam- 
pling is concerned with sampling from a twofold population. 
Consider the following problem: Suppose a large city is undergoing 
a fiercely contested election. The Radicals on the one hand and 
the Conservatives on the other are contending hotly for the 
mayoralty, and everyone in the city takes a stand on one side or 
the other. The voting population of the city thus forms a group 
in which a certain percentage are Radicals and a complementary 
percentage are Conservatives. Prior to the election these per- 
centages will not be known. They may, however, be estimated 
by taking a random sample. The inferences that may be made 
from such a random sample constitute the statistical problem 
that will now be analyzed. 

Sampling Distribution. For the sake of argument suppose that 
some omniscient being knew how each individual in the city stood 
politically. Suppose that he noted their positions on slips of 
paper—one for each individual—and put the slips into a large 
urn. Suppose, further, that there are actually an equal number 
of Radiezlsand Conservatives. Let the omniscient being mix the 
slips of paper thoroughly and then draw out a sample of 100 slips.” 


1 For more elaborate exposition than is contained in this chapter, see 
Smith and Duncan, Sampling Statistics, Parts II and HI. de Ai 
* Mundane methods of obtaining random samples are discussed in ibid. 
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Let him note the division of opinion for this sample, put the slips 
back, and thoroughly mix them again. Finally, let him repeat 
this proeess many times, taking a sample of 100 eaeh time, so that 
he eventually aeeumulates a large number of sample percentage 
divisions of opinion. Many, but by no means all, of these sample 
percentages will be the actual population percentage of 50; the 
others will be distributed above and below the 50 per cent level. 
This will be the sampling distribution of the sample percentages. 
It is one of the important conclusions of the probability theory, 
based upon the analysis of the preceding chapters,* that the 
outcome of this process of random sampling will be a set of sam- 
ples in which the relative frequency of samples in which the 
division of opinion is 0 per cent Radical, 10 per cent Radical, 20 
per cent Radical, 30 per cent Radical, . . . , 100 per cent Radical 
will be approximately the same as the probabilities of a binomial 
distribution in which pı = 0.50 and p; = 0.50 and N = 100.1. In 
other words, relative frequencies of the sample percentages may 
be estimated a priori by means of the probability calculus. 
Furthermore, since the size of the sample is large (N = 100), the 
calculation of the probabilities can be simplified by using the 
normal curve as an approximation to the binomial distribution. f 
In this problem, the curve will have a mean of 50 per cent, 
because the population is equally divided between Radicals and 
Jonservatives by hypothesis, and a standard deviation equal to 
5 per cent.§ The normal curve, with a mean of 50 per cent 
and a standard deviation of 5 per cent, thus gives approximately 
the “sampling distribution" for sample percentages taken from 
a population in which the division of opinion is exactly 50 per 
cent; and this is the sampling distribution of sample percentages 
conceived in the preceding paragraph. 
The foregoing result is not limited, however, to cases in which 
the actual division of opinion in the entire population is exactly 


* Bee also ibid. 

t When the symbol for a sample statistic is in boldface type, it refers to 
the corresponding population parameter; thus here pı and f» refer to the 
population values for which p; and p: are corresponding sample statistics. 

1 See pp. 283-290. 

$ When the variable is expressed as a percentage instead of as an absolute 
deviation from an integral mean value, the formula for the standard devia- 


tion is oper cent = ~/(0.5)(0.5)/N. Cf. p. 283. 
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&fty-fifty but may be shown to be valid for any division of opinion 
in the population. Thus if the percentage of Radicals in the 
population is P: and the percentage of Conservatives is f» (where, 
bi + f» = 1) and samples of size N are drawn at random from this 
population, with replacements as above, then the relative fre- 
quencies of various sample percentages of Radical opinion will be 
given approximately by the probabilities of a normal frequency 
curve whose mean is Np; and whose standard deviation is 
V bips/N. 

This conclusion is of capital importance in making inferences 
about a population from which a single random sample has been 
drawn, as will now be demonstrated. 

Statistical Inferences from Samples. Types of Inference. Ina 
real instance, no omniscient being is available to record every- 
one’s opinion. Prior to the actual election, the only practical 
way of determining the division of opinion is to take a random 
sample from the population. This may be done by stopping peo- 
ple on the street, ringing doorbells, sending out letters, or the 
like, When the results of the sample poll are counted, they may 
be used to draw inferences about the true division of opinion in the 
population in three ways—that is to say, three types of inference 
may be drawn. (1) A certain hypothesis regarding the true 
division of opinion may be tested as to its reasonableness in the 
light of the sample results and either rejected or accepted. 
(2) So-called “confidence limits” may be set up for which it may 
be said that there is a given probability that these limits include 
the true value. (3) A best single estimate may be made of the 
population percentage; this is called an “optimum estimate." 
Each of these three types of inference will now be studied. 

Testing a. Hypothesis as to the Population Percentage. Let the 
hypothesis be set up that the population is evenly divided 
between Radieal and Conservatives. Suppose the sample poll of 
'100 voters shows 57 Radicals and 43 Conservatives. Although 
the sample shows a percentage in favor of the Radicals, it is 
possible, of course, that it may be misleading. Almost any result 
might be yielded by a single sample, whatever the population. 
If the population consisted even of 999,900 Conservatives and 100 
Radicals, it would still be possible for a random sample of 100 to 

1 For proof of this, see Smith and Duncan, Sampling Statistics, pp- 
136-190. 
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consist of all Radicals. Such a result would not be very proba- 
ble, however, and the reasonableness of any hypothesis must be 
judged by the probability of the sample result on the assumption 
that the hypothesis is valid. 

The general procedure for testing the hypothesis is as follows: 
First, the risk that is to be allowed in rejecting a given hypothesis 
when it is in fact true must be decided upon. The ‘‘ coefficient of 
risk,” as it is called, is commonly, but not necessarily, set at 0.05. 
In other words, it is the common practice to run the risk of reject- 
ing a hypothesis 5 times out of 100 when it is in fact true. When a 
sampling distribution is normal, this is often done by saying that a 


Area of Area of Area of 

rejection acceptance rejection 
below between above 
40% hand 60% 60% 


-—2g---- e ---2g---- 


40 50 51 '60 62 
Per cent 


Fra. 102,—Sampling distribution of sample percentages of 100 votes. 


given hypothesis will be rejected if the sample result falls beyond 
+26 from the mean value given by the hypothesis. In the 
present instance, the hypothesis that the true division of opinion 
is fifty-fifty suggests that random samples of 100 taken from such 
a population will have a mean percentage of Radical votes equal to 
bı = 50 per cent and a standard deviation of sample pereentages 
equal to \/pipo/N = 4/(0.5)(0.5)/100 = 5 per cent. 
Accordingly, 95 per cent of the sample percentages would fall 
between 50 per cent +2 X 5 per cent, or between 40 and 60 per 
cent; 5 per cent of sample percentages would fall below 40 and 
above 60 per cent. Hence, if this hypothesis is rejected when a 


! The desirability in some cases of using regions of rejection that fall all 
above or all below the mean are discussed in ibid., pp. 196—201. 
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single sample return yields a percentage of Radical vote below 
40 or above 60 per cent, then the hypothesis would in many 
sample polls be rejected only 5 per cent of the time when it was 
actually true. In other words, the rejector would be wrong only 
1 out of 20 times in a large number of tries. 

For the given problem, suppose the coefficient of risk is put at 
0.05. Since the sample return is 57 Radical votes out of a total 
of 100, the hypothesis of an equal division of opinion is not to be 
rejected, for the sample result does not fall in the region of rejec- 
tion below 40 or above 60 per cent. In this instance, the sample 
result does not deviate sufficiently from the hypothetical per- 
centage to cause its rejection. If the sample return had been 
62 Radieals and 38 Conservatives, however, the hypothesis of 
an equal division of opinion would have been rejected and it would 
have been concluded that the Radicals were in the majority. 
This argument and these conclusions are illustrated graphically 
in Fig. 102. 

From the figure it is seen that with a sample result of 57 per 
cent the hypothesis that f; = 0.5 is accepted while with a sample 
result of 62 per cent the hypothesis that P, = 0.5 is rejected. 

Determining Confidence Limits for Population Percentage. 
Before confidence limits can be established for a population 
percentage it is first necessary to decide upon the degree of con- 
fidence that is to be placed in the computed limits. This is 
usually determined by so choosing the limits that the probability 
of their including the true percentage equals an agreed-upon 
figure, called the “confidence coefficient.” For example, if the 
confidence coefficient is set at 0.95, as is the common practice, 
then the limits will be so chosen that the probability of their 
embracing the true value is just 0.95. 

In the case of a normal sampling distribution, confidence 
limits with a confidence coefficient of 0.95 may be set up as fol- 
lows: Choose as the upper confidence limit a value for the popula- 
tion percentage that, if it were the true value, would make the 
probability of getting the given sample value or a lower sample 
value just equal to 0.025. Since the sampling distribution is 
normal, this upper limit may be obtained by choosing 5: so that 
the sample value of 57 per cent falls at —26 from the mean value 
of the sample percentage, i.e., at —26 from pı. The mathematical 
equation becomes 
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0.57 = pi = —2 


pips 
N 
or, since f» = 1 — py 


0.57 — p= —2 NZD 


When solved for pı, this becomes 


0.57) (0.43 l 
osT+ 2 S d Co 


bi 
Is E 


When N is large, as it must be if the normal distribution is to be 
used as an approximation to the binomial distribution, the terms 
2/N, 4/N, and 1/N? can be dropped from the above equation 
without materially affecting the result. In this approximate 
form it becomes 


A (0.57043) a 
pi = 0.87 + 2 | ne = 0.67 


In effect, this indicates that the upper confidence limit can be 
found approximately by adding to the sample percentage twice 
the standard deviation of the sampling distribution, computed 
with the sample percentage in place of the hypothetical popula- 
tion percentage. In general, if p, is taken as the sample per- 
centage (note that sample statistics are printed in text type and 
the corresponding population parameters in boldface type), the 
upper confidence limit of the population percentage is given by 


hop pr BEL p (1) 


This is shown graphically in Fig. 103a. 
In a similar manner, the lower confidence limit is given approxi- 
mately by the formula 


h-p»-2 Jem = p) (2) 
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For the given instance, in which p, = 0.57, this lower limit is 


(0.57)(0.43) _ 


100 Ke 


p: = 0.57 —2 

This is shown graphically in Fig. 103b. 
How the upper limit is determined, how the lower limit is 
determined, and the resulting range or total interval between the 
confidence limits are pictured graphically in Figs. 103a, b, and c. 


* 
(6) 
| 
k----- 20--->«---+20--->| (c) 
l l l 
OAT 057 0.67 i 
Fic. 103.—Diagram showing how the limits of the confidence interval are 


determined. 


The limits of the range are 0.47 to 0.67. This is known tech- 
nically as the “confidence interval ? and is shown in Fig. 103c. 
Owing to the manner in which the confidence limits were derived, 
it may be said that there is à probability of 0.95 that this con- 
fidence interval includes the true population percentage. By this 
is meant that, if confidence intervals were set up like this from 
many samples, 95 per cent of them would include the true 
population percentage. 

An Optimum Estimate of the Population Percentage. Up to 
this point in the argument, a particular hypothesis regarding the 
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population has been tested and a method of setting up confidence 
intervals has been devised. A final problem of statistical 
inference is to indicate a method of making a single best estimate 
of the population percentage from the given sample. Various 


KH equals anti-log 
Probability of of likelihood 
given sample of p,=062 


Ds 5:06 


| | 


(6) 
Probability of 
given sample 


equals anti-log 
of likelihood 
of p "057 
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of, p,7045 
wi 
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Fie. 104.—Diagram showing relationship between probability of sample and 
likelihood of population percentage. 
methods are employed, but the one that has received consider- 
able prominence in recent years and that will be employed here 
is the method of maximum likelihood. 
When a population percentage is given, the probabilities of 
various sample results may be determined from the sampling 
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distribution of ‘sample percentages, in this case, approximately 
from the normal frequency curve. The analysis here runs from 
a given population percentage to probabilities of various sample 
results. When a particular sample result is given, however, it 
is possible to determine the probabilities of obtaining this sample 
result from various hypothetical values for the population per- 
centage. Here the analysis runs from a given sample percentage 
to the probabilities of obtaining the particular sample from 
various hypothetical population percentages. In the latter 
analysis, the logarithm of the probability of the given sample 
result for a particular value of the population percentage is 
called the “likelihood” of the population percentage. 

s shown in Figs. 104a to 104c, these likelihoods will vary for 
different hypothetical values of the population percentage. The 
value of the population percentage that has the maximum likeli- 
hood is considered the best, or optimum, estimate of the popula- 
iion percentage; this is shown in Fig. 104d. Figures 104a to 
104c show graphically how the likelihoods of various population 
percentages (or, more exactly, their antilogs) vary with changes 
in the hypothetical values for these percentages. These various 
results are summarized in Fig. 104d, which, if completed for a 
large number of hypothetical values of the population per- 
centage, would become a smooth curve showing the variation 
in the antilogs of the likelihoods of i with changes in pı. It is 
to be noted that the maximum point of this curve is also the 
point of maximum likelihood, since a logarithm is a maximum 
when its antilog is a maximum. 

Without undertaking the mathematical analysis involved," 
it may be pointed out that the value of pı which has the maximum 
likelihood is the value for which A = f, In other words, 
the maximum likelihood estimate of a population percentage is the 
percentage yielded by a given sample. This then becomes the 
best estimate of the population figure; that is to say, the sample 
percentage is the optimum, or best, estimate of the population 


percentage. 
SAMPLING OF MEANS AND VARIANCES 


Sampling Distribution of Means and Variances. The Mean. 
Most of the preceding analysis applying to sample percentages 


! For such analysis, see ibid., pp. 208-209. 
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applies equally well to means of samples from a continuously 
distributed population. If the population is normal in form, it 
can readily be demonstrated that means of samples from such a 
population will form a frequency distribution which is also normal 
in form, the mean of which is the mean of the population and the 
variance of which is the variance of the population divided by the 
size of the sample. 

If the population is not normal, the sampling distribution of 
sample means nevertheless tends to be normal, with a mean 
equal to the mean of the population and a variance equal to the 
variance of the population divided by the size of the sample." 

Accordingly, the equation for the standard deviation of the 
sampling distribution of sample means is as follows: 


d 
6x VN 
This is conventionally called the “standard error" of the mean.* 

The Variance. If samples are taken from a normal population, 
the sampling distribution of sample variances is not normal for 
small samples but approaches the normal form as the samples 
become larger, say larger than 30 cases. The mean of this 
normal distribution is the variance of the population, and the 
standard deviation of the sampling distribution is the variance 
of the population multiplied by 4/2/N. 

It is to be noted that, if the population is not normal, the 
sampling distribution of sample variances may not become 
normal, even for relatively large samples. Hence the use of 
the normal curve for making inferences about a population 
variance when the population is not normal may be an unwise 
procedure, even when the sample is large. 

But for variances of large samples taken from normal popula- 
tions, the standard error of the variance is given by 


(3) 


2 2 
dà = di Nw da = i Js (4) 


1 Ibid., p. 164. 

2 Standard errors are printed in boldface type because they represent the 
standard deviations of the populations of all possible sample statistics of 
the type in question. Thus éx is the standard deviation of all possible 
sample X's, 
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The Standard Deviation. For standard deviations of large 
samples taken from normal populations, the standard error of 
the standard deviation is given by 

d 
V/2N 

Inferences about Population Means and Variances. Since 
the sampling distribution of sample means tends to be normal 
in form and the same is true of the sampling distribution of 
variances and standard deviations, if the population is normal, 
it follows that the normal curve can be used to make inferences 
about the population values of these parameters from correspond- 
ing sample statistics. 

Testing a Hypothesis about the Population Mean. To illustrate 
how a hypothesis about a population mean may be tested, con- 
sider the following example. Suppose it is claimed that the 
mean length of life of a certain make of shoe (with constant wear) 
is 11.5 months, A random sample of 100 shoes is tested, and 
it is found that the average length of life of this sample is 10.8 
months. The standard deviation of the sample is 1.2 months. 
Do these sample results warrant the rejection of the claim of a 
true mean value of 11.5 months? 

To answer this question, proceed as follows: Let the risk of 
rejecting a hypothesis when it is true be set at 0.05. Then cal- 
culate the standard deviation of the sampling distribution of the 
mean (the “standard error” of the mean, as it is called) from 
Eq. (3). Since the standard deviation of the population is not 
knowh in this instance, the standard deviation of the sample 
must be used in its stead.' 

The value of s for the given problem is accordingly 

gx = cH Es 0.12 month 
100 

Next, calculate the difference between the hypothetical value 

of the mean and the sample value of the mean. This is 


10.8 — 11.5 = 0.7 month 


Finally, compare this difference with the standard error of the 
mean. If the difference is more than twice the standard error, 


l'This substitution does not materially affect the analysis when the 
sample is large. For further discussion, see ibid., pp. 273-284. 


d, — (5) 
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the hypothesis will not be accepted. In the present instance 0.7 
is over five times greater than 0.12; so the claim that the true 
mean is 11.5 is rejected. The sample mean deviates too greatly 
from the hypothetical mean for the latter to be accepted as 
reasonable. 

Confidence Limits for the Population Mean. Confidence limits 
for the true mean with a confidence coefficient of 0.95 will be 
obtained by laying off 20x plus and minus from the sample 
value. Thus, in the present problem these limits will be 
10.8 + 2(0.12) = 11.04 and 10.56. Accordingly it can be said 
that there is a probability of 0.95 that the interval from 10.56 to 
11.04 includes the true population mean within its range. 

Optimum Estimate of the Population Mean. If the method of 
maximum likelihood is used to give the best estimate of the 
population mean, it is found that the sample mean is the maxi- 
mum likelihood estimate of the population mean. Hence, in 
the present instance the best estimate of the population mean 
is 10.8 months. 

Testing a Hypothesis about the Population Variance. The 
same analysis can be applied to inferences regarding population 
variances from sample variances. Suppose it is claimed that 
the true variability in the life of the given make of shoes is 
1.0 month. As in the case of the mean, this hypothesis may be 
tested by comparing the hypothetical value with the standard 
deviation of the sample of 100 shoes, which it will be assumed is 
1.2 months. 

The variances, or squares of the standard deviations, are 1.0 
and 1.44 square months, respectively. Their difference is 
1.44 — 1.00 = 0.44 square month. The standard deviation of 
the sampling distribution of sample variances, i.e., the standard 
error of the sample variance, is 


* ay 
w/z = 1.00 ZE - 0141 


Since the difference between the hypothetical value and the 
sample value is more than three times (0.44/0.14 = 3+) the 
standard error of the sample variance, the hypothesis must again 
be rejected.* 


1 For more exact methods especially applicable to small samples, see ibid., 
pp. 284-287. 
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If it were desired to test a hypothesis about the standard devia- 
iion, rather than about the variance, Eq. (5) would be used. In 
the present instance, the population standard deviation is hypo- 
ihetieally set at 1.0 month and the standard deviation in the 
sample is 1.2 months; the difference is 0.2 month. Using Eq. (5), 
ihe standard error of the standard deviation in this problem is 
found to be 


EE 
/200 
since the difference is almost three times the standard error, the 
hypothesis is rejected as unreasonable. 
Confidence Limits for the Population Variance. Confidence 
limits for the population variance with a confidence coefficient of 
0.95 are given by 


6 = o? + 264 
= 144 + 2(0.14) = 1.72 and 1.16 


It can thus be said that there is a probability of 0.95 that the 
interval from 1,16 to 1.72 includes the true variance. The cor- 
responding interval for the population standard deviation is from 
1.06 to 1.34, obtained by making use of Eq. (5). 

Optimum Estimate of the Population. Variance. Finally, the 
maximum likelihood estimate of the population variance is (for 
large samples) approximately the variance of the sample.' 
Hence the best estimate of the population variance in this instance 
is 1.44, which gives a population standard deviation of 1.2. 

CONCLUSION 

From the few illustrations in this chapter, it should be clear 
that the normal curve is very useful in making inferences about 
populations from random samples. It can be used to measure 
sampling fluctuations in sample percentages, sample means, and, 
in certain instances, sample variances, as well as in a number of 


should be applied to the sample 


e N 
! For small samples the multiplier z — 


variance to give a better estimate of the population variance. Thus the 
optimum estimate, if V is small, say less than 30, is as follows: 


Eels (eS 
# =o NE 


Cf. ibid., pp. 290-294. 
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other statistics. It also has many uses in more advanced sam- 
pling analyses and is probably the most important sampling 
distribution that occurs in statistical theory. 

Table 23 contains not only the standard errors discussed in 
this chapter but also the standard errors for a number of other 
statistics. The method of applying these formulas to test 
hypotheses, to set up confidence intervals, or to obtain optimum 
estimates is similar for all statistics obtained from large samples. 
TABLE 23.—SAMPLING Errors IN ELEMENTARY STATISTICS FOR WHICH THE 


SAMPLING DISTRIBUTION APPROXIMATES THE NORMAL CURVE 
(Ordinarily these formulas for standard error cannot be used for N < 30) 


Statistics Standard errors. 
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* Cf, Wauan, ALBERT E., Elements of Statistical Method (1938), pp. 142-147. 


PART IV 
Study of Bivariates and Multivariates 


CHAPTER XIIT 
SIMPLE CORRELATION 


CORRELATION FUNDAMENTAL TO KNOWLEDGE 


Progressive development in the methods of science and philos- 
ophy has been characterized by increase in the knowledge of 
relationships, or correlations. Nature has been found to be a 
multiplicity of interrelated forces. The phenomena of the 
physical world outside man seem to be well adapted to this 
concept of interrelationship. The same is true with respect 
to phenomena having to do with human beings and their 
environment. i 

Progress in the Discovery of Correlation. In the physical 
sciences, where the laws of nature are, within certain limits, 
determinate, experimental method has sufficed to disclose innu- 
merable relationships. Many of these physical correlations have 
become definitely known as “cause and effect relationships.” To 
some degree, too, this is true of biology, anthropology, geology, 
and the like. In these fields of study, great progress was made 
possible by the use of observation of “cases,” by tracing cor- 
relations previously known or suspected, and by laboratory 
experiments. In the social sciences, however, the establishment 
of certain knowledge, or knowledge of a high degree of probability 
regarding relationships, is a more difficult problem; and little 
scientific progress, comparatively speaking, has been made 
through the speculative method. This is particularly true so far 
as cause and effect relationships are concerned, 

For example, philosophical speculation, based upon qualitative 
or semiquantitative observation of experience seemed to many 
economists of the eighteenth, nineteenth, and twentieth centuries 
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to have codified the relationship between money and credit on the 
one hand and prices and many social problems on the other hand. 
But no such certainty among these social scientists now exists as 
to the nature of the cause and effect order of events. In its 
earlier conception, the principle of the quantity theory of money 
seemed to be one of extraordinary simplicity and determinate- 
ness; but the more it is studied in its quantitative aspects the 
more complicated it is found to be in reality. By the 1930’s 
and 1940’s, the world of scientific monetary theorists came to be 
characterized by confused controversy. The practical world still 
awaits their solution of the theoretical problem in order to make 
possible a world-wide solution of the problem of monetary reform. 
Some say that increases or decreases in the quantity of money 
cause rising and falling prices, respectively; but others, with con- 
vincing argument, maintain that rising prices cause an increase in 
the quantity of money, and vice versa. It is a moot question as 
to whether or not statistics can come to the rescue in the matter 
of deciding the direction of the cause and effect relationship; 
but at least the technique has been developed to disclose the 
facts of relationship more precisely than was ever before 
possible. 

By the latter half of the nineteenth century, in many fields of 
study, a point had been reached where speculation concerning 
relationships could advance no farther with the existing tech- 
niques. More exact measurement of relationship was needed. 
Many questions in biology, anthropology, and the social sciences 
generally awaited a scientific answer to the question: How can 
relationship be measured? ‘Two interesting attempts were made 
by American scholars to devise a method of measuring relation- 
ship, one in 1877 and the other in 1892.! Credit for the discovery 
of a method, and for its subsequent mathematical development, 
however, belongs largely to the scholars of England. 

Origin and Development of the Measurement of Correlation. In 
the nineteenth century pre-Darwinian and Darwinian doctrines of 


1 Bowprrcu, H. P., “The Growth of Children," Eighth Annual Report 
of the State Board of Health of Massachusetts (1877), pp. 275-324; Bryan, 
W. L., “On the Development of Voluntary Motor Ability," American 
Journal of Psychology, Vol. 5 (1892), pp. 123-204. These are both described 
in Helen M. Walker, Studies in the History of Statistical Method (1929), pp. 
100-102, 109-110. 
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evolution were taking root, and the question of the influence of 
heredity vs. environment upon human characteristics was in a 
state of rarefied speculation and controversy. The experimental 
data appeared chaotic and amenable to as many interpretations 
as there were interpreters. 

One of the great nineteenth-century students of the problem of 
heredity was Sir Francis Galton. He had been profoundly 
impressed by Darwin’s Origin of Species (1859), concerning which 
he said,! “Its effect was to demolish a multitude of dogmatic 
barriers by a single stroke, and to arouse a spirit of rebellion 
against all ancient authorities whose positive and unauthenticated 
statements were contradicted by modern science." Galton made 
numerous studies on the subject of heredity. The question that 
was motivating his studies was: How is it possible for a whole 
population to remain alike in its features, as a whole, during many 
successive generations, if the average produce of each couple 
resemble their parents? He attacked the question by studying 
sweet peas, moths, hounds, and finally the records of human 
families, which he obtained by offering prizes. 

Between the years 1877 and 1889, Galton worked out a mathe- 
matical method by which he could give an exact measure of the 
relationship between, for example, heights of children and the 
average heights of their parents. By statistical measurement he 
found that, if the stature of a group of parents is found to be, say 
y inches above or below the general average of the race, the aver- 
age stature of their children will be only $y inches above or below 
the average of the race; and he induced the law that the mean 
heights of offspring tend to “regress back toward the mean of the 
race” in spite of the strong hereditary influence of the parents. 
This is the famous law of regression to type, although the exact 
figure $ is not to be taken as final. 

The method Galton used was based upon the median and 
quartiles and has not been generally followed in subsequent work. 
In the 1890’s another method, based on the arithmetic mean 
and the standard deviation, was devised by Karl Pearson. His 


1 “Hereditary Talent and Character,” Macmillan’s Magazine, Vol. 12 
(May, 1865-October, 1865), pP. 157-166, 318-327; Hereditary Genius 
(1869, 2d ed. 1892); English Men of Science (1874); Human Faculty (1883); 
Record of Family Faculties (1884); Life History Album (1884); Natural 
Inheritance (1889). Cf. WALKER, op. cit., pp. 102-103. 
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method has been widely adopted and is known as the “ Pearsonian 
coefficient of correlation.” + 

It should be pointed out that in the fields of meteorology and 
astronomy mathematicians had previously worked out a formula 
for a joint or bivariate normal frequency distribution. This 
gave the probability of the simultaneous occurrence of two errors 
of observation but did not directly indicate a measure of correla- 
tion between them. Work in this field was more concerned with 
the simultaneous occurrence of independent errors than of 
correlated errors.? Galton, as already indicated, was primarily 
concerned with the problem of correlation, and it remained for 
Karl Pearson and others to combine the work of Galton and the 
work of the mathematicians into a unified theory of correlation. 
Pearson’s development of the theory of correlation will be 
explained on page 338 to 349. 

Applications of the Method by Social Scientists. As early as 
1901, R. H. Hooker, using the Pearsonian coefficient, studied 
correlation between marriage rates and trade. He correlated 
marriage rates with per capita exports of England, with per 
capita imports, and with other trade events. In 1906, G. 
Udny Yule likewise made a study of correlation between mar- 
riage rates and trade. He also correlated trade activity with 
birth rates and death rates but found little correlation between 
them.* 


1Cf. WALKER, op. cit, pp. 110-115; Pearson, Karu, “Notes on the 
History of Correlation,” Biometrika, Vol. 13 (1920-1921), pp. 25-45, where 
he cites W. F. R. Weldon, “Variations Oceurring in certain Decapod Crus- 
tacea—I. Crangon vulgaris," Proceedings of the Royal Society of London, Vol. 
47 (1890), pp. 445-453; WELpoN, W. F. R., “Certain Correlated Variations 
in Crangon vulgaris,” Proceedings of the Royal Society of London, Vol. 51 
(1892), pp. 2-21; Yuin, G. U., “On the Theory of Correlation,” Journal of 
the Royal Statistical Society, Vol. 60 (1897), pp. 812-850. 

? Prerortus, B. J., “Skew Bivariate Frequency Surface, Examined in the 
Light of Numerical Illustrations,” Biometrika, Vol. 22 (1930-1931), pp. 
109-223; Pearson, Karu, “The Contribution of Giovanni Plana to the 
Normal Bivariate Frequency Surface,” Biometrika, Vol. 20A (1928), pp. 
295-298; WALKER, Hexen M., “The Relation of Plana and Bravais to the 
Theory of Correlation,” Isis, Vol. 10, No. 34 (1938), pp. 466—484. 

3 "Correlation of the Marriage-rate with Trade," Journal of the Royal 
Statistical Society, Vol. 64 (September, 1901), pp. 485-492. 

* Yung, G. Upny, “On Changes in the Marriage- and Birth-rates in 
England and Wales, Ete.,” Journal of the Royal Statistical Society, Vol. 69 
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The entire science of biometrics has been built up by the 
development of correlation methods; Karl Pearson is one of the 
founders of Biometrika, the scientific organ in that field of study. 
Correlation measurement has been intensively applied in psy- 
chological and educational research. In recent years, the 
correlation method has played an important role in the analysis 
of economie problems and in economie theory, a trend particularly 
evident in the field of agricultural economies. 


THE BIVARIATE FREQUENCY DISTRIBUTION 


The statistical basis for the study of correlation is the bivariate 
or multivariate frequency distribution. In the univariate 
frequency distributions studied in the previous chapters, the 
data were classified according to a single characteristic. In 
bivariate or multivariate distributions, data are classified accord- 
ing to two or more characteristics. This chapter will be con- 
cerned with the analysis of bivariate distributions. Chapter 
XVI will deal with multivariate distributions. 

An Illustration of a Bivariate Distribution. Table 24 shows 
the distribution of grades of 81 freshmen in a second-semester 
English course at Mount Holyoke. For each of these 81 students 


‘Tarte 24—Grapes or 81 Mount HOLYOKE FRESHMEN IN A SECOND- 
SEMESTER ENGLISH COURSE 


Grades Frequencies 
X. F 
60— 1 
80- 0 
100- 3 

120- 0 
140- 2 
160- 9 
180- 8 
200- 16 
220- 17 
240- 13 
260- 9 
280- 1 
300- 2 

81 


(1906), pp. 88-132; “The Applications of the Method of Correlation to Social 
and Economie Statisties," Journal of the Royal Statistical Society, Vol. 72 


(1909), pp. 721-730. d ; 
1 Ruca, Hanorp O., Statistical Methods Applied to Education. 
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there is also available the grade in first-semester English. Hence 
they may be eross-classified according to both their first- and 
second-semester grades. This has been done in Table 25. 

TABLE 25.—A BIVARIATE FREQUENCY DISTRIBUTION op 81 Mount HOLYOKE 


FRESHMEN Accorpine TO THEIR GRADES IN FIRST- (Xz) AND SECOND- 
(X1) Semester ENGLISH 


Xt = 
60- 1 l 
80- 9 
100- | 2 1 | 3 
120- | 0 
140- 1 1 2 
160- 5 | iet Si 9 
180- Gleck 8 
200- EX etr Dec 16 
220- PX E sra e^ 17 
240- 2| | a| 1 13 
260- FS Katz 4| 4 9 
280- | 4 | 1 
300- | | KG EES a2 
F 2/ 0) 1/ 7|5|Ss|9][|13|15|1| 732 T ar 


The bivariate frequency distribution represented by Table 25 
gives more complete information than is contained in the uni- 
variate frequency distribution of Table 24. Of the 8 students 
having second-semester grades between 180 and 200, the seventh 
row of Table 25 shows that 2 had first-semester grades between 
140 and 160, 4 had first-semester grades between 160 and 180, 
and 2 had first-semester grades between 180 and 200. This is a 
small univariate frequency distribution of the group of students 
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who had grades between 180 and 200 in their second-semester 
course. In Table 25 there are 11 rows and 11 columns each of 
which contains a univariate frequency distribution. Since there 
are 11 subgroups of 11 groups, there are altogether 121 classes, 
represented in the table by 121 squares, or cells, of which 28 cells 
contain frequencies. 

The totals of the columns of Table 25 gives the univariate 
frequency distribution of all the students classified according to 
their first-semester English grade. The totals of the rows gives 
the univariate frequency distribution of all the students classified 
according to their second-semester English grades. 

For each of the columns an arithmetic mean could be calcu- 
luted and the question could be answered: Did girls who earned 
high grades in their first-semester English average higher grades 
in second-semester English than did the girls who attained only 
low grades in their first-semester English? An arithmetic mean 
could similarly be calculated for each of the row frequency 
distributions. For all the 11 column frequency distributions 
and all the 11 row frequency distributions the standard deviations 
also could be calculated. In other words, in this bivariate 
frequency distribution there are 22 univariate frequency dis- 
tributions in addition to the 2 univariate frequency distributions 
represented by the totals for the respective variables. Each of 
these 22 frequency distributions might be analyzed in the same 
way as any frequency distribution. 


METHODS OF SUMMARIZATION AND COMPARISON IN 
BIVARIATE DISTRIBUTIONS 


The characteristics of a bivariate frequency distribution can 
be described by various statistics. Many of these are the same 
as the statistics employed in the description of a univariate 
frequency distribution, but some are new. Thus, the central 
tendency of one of the two variables may be measured by its 
mean, its mode, or its median. The dispersion of this variable 
may be measured by its range, standard deviation, average 
deviation, or quartile deviation; and its skewness and kurtosis 
may be measured by ñ, and Bs respectively. The same is true 
for the other variable and for the numerous univariate frequency 
distributions that make up the details of a single bivariate 
distribution, as explained in the preceding paragraph. New 
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statistics are required, however, to describe the tendency of the 
variables to vary in unison. A bivariate frequency distribution 
thus presents the new problem of measuring correlation and the 
discovery of statistics for measuring it. 

Progressions of Means. If the data are grouped in the form 
of a bivariate scatter diagram such as Table 25, one way to 
measure the association between the two variables is to compute 
the mean values of one variable for various values of the other 


X52041 


X, 52/74 


X{=47579/+0.8322X2 | 
g, 2=t/9.53 


140 po == =x = Progression 
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Fig, 105.— Progression of the means of X, with changes in Xa. 


variable. In Table 26, for example, the means of the columns 
would show how the X; variable tends to change on the average 
with changes in Xs, and the means of the rows show how the Xs 
variable tends to change on the average with changes in X;. 
The values of these column and row means are given in Table 26 
and graphed in Figs. 105 and 106. 

The nature of the association between the variables is evident 
from these graphs. Consider, for example, the progression of 
the means of X; shown in Table 26 and Fig. 105. These show 
that the mean value of X; tends to increase with increases in Xe. 
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‘Thus, when X, is between 100 and 120, the mean value of X i is 
110; when Xs is between 200 and 220, the mean value of X; is 
222.31; and when X; is between 260 and 280, the mean value of X 
is 266.0. Although the increase in the average value of X, with a 
given increase in Xs does not appear to be uniform, the progres- 
sion of the means of X, with a change in X» does appear to follow 


X X,-2041 3 


5 


X,=2/74 


180 

X37 -5.548 + 09642X, 
dei 25, 7$ 2102 
140 — o-o- -o Progression 


of theX p 


X, 
60 80 100 )20 MO 160 180 200 220 240 260 280 300 e 


Fre. 106.—Progression of the means of X: with changes in Xi. + 


a straight line. The same can be said of the progression of the 
means of X; with changes in Xi. : 

Lines of Regression. The tendeney of the progressions of 
means to follow straight lines suggests the following hypothesi: 
Consider first the progression of the means of X; with changes in 
X» Suppose that X, is related to X; in such a way that an 
increase in X, of one unit always produces an increase in X; of, 
say b units, b being a constant. If Xo were the only factor affect- 
ing X,, all the values of Xi, when plotted, would fall exactly on a 
straight line and the progression of all means would be perfectly 
linear. If there were other forces affecting X5, however, causing 
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it to be higher or lower than the value expected from its associa- 
tion with Xs, then the actual values would not fall on a straight 
line but would be scattered about that line. If this view of the 
variation between X, and X, is adopted, a straight line fitted to 
the data should give the law of relationship between X; and X; 
and the scatter about it should give the deviation from this line 
caused by the other factors affecting X. 
Taste 26.—MxaNs or Rows AND MEANS op COLUMNS 


From correlation table showing the relationship between second- amd first- 
semester grades of 81 Mount Holyoke freshmen 


Progression of means of second-semester 
English grade (Xi) with successive values 


Progression of means of first-semester 
English grade (X2) with successive values 


of first-semester English grade (X:) of second-semester English grade (X) 
Regression of X» ons Regression of Xs o, 
(Vertical frequency distributions in Table 35) (Horizontal frequency distributions in 
Table 25) 
| 
Values of ys | Mesna ot X IIT Means of Xa 
60- | — 110.00 60- 130.00 
ag ian epe ik | 80- 
100- 110.00 100- 83.33 
120- 152.86 | 120- 
140- 178.00 | 140- 160.00 
160- 195.00 — | 160- 141.11 
180- 203.33 180- 170.00 
200- 292.3] | 200- 200.00 
220- | ` 201 220- 225.29 
240- 250.00 240- 234.62 
260- 266.00 — | 260- 256.67 
280- 310.00 | 280- 230.00 
300- | 290.00 


A similar view could be taken of the variation in the mean 
value of X» with changes in X; and would justify drawing a 
straight line to show the law of relationship between X; and X;. 
The lines that are derived to show the relationship between the 
mean value of one variable and ehanges in value of another are 
called “lines of regression," following Galton, who used this term 
in his original study of the relationship between the heights of 
children and the heights of their parents. 

The Line of Regression of Xi on Xz. Suppose the above hypoth- 
esis is adopted, namely, that X, is linearly related to X, and 
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thay deviations from this relationship are the result of forces 
independent of Xz. The statistical problem then becomes how to 
draw the line that is supposed to show this linear relationship. 

One of the simplest ways of finding the line of regression of X, 
on Xs is to plot the progression of the means of X, for various 
values of X, and to draw a line freehand through the means so that 
it seems to fit the progression of means. The great difficulty with 
this method is that it involves considerable personal diseretion 
and that no two persons will necessarily draw the same line. 

An impersonal method of fitting a line to a given set of data is 
the so-called “method of least squares.” This fits a line to the 
data so that the sum of the deviations of the dependent variable 
from the line is zero and the sum of the squares of the deviations is 

minimum (hence the name “method of least squares"). 
\lathematically, the first of these two conditions follows from 
the second, so that there is really only one condition, viz., that of 
least squares. 

The use of the method of least squares to fit lines to a set of 
data goes back to the beginning of the nineteenth century. It 
first came into prominence in 1806, when Adrien Marie Legendre 
(1752-1833) published a book on new methods of determining 
orbits of comets. After the publication of this book, Karl 
Friedrich Gauss (1777-1855), a German mathematician, claimed 
that he had been applying this principle since 1795. 

Later it was shown that, if the method is used to fit a line to a 
sample set of data, then, under particular circumstances, the line 
so determined is the best, or optimum, estimate of the population 
line. For example, if data are available as to the orbit of a comet 
or planet and if a line or curve is fitted to these data by the method 
of least squares, then the line or curve so obtained would be the 
most probable estimate of the true orbit. 

The line of regression of X; on Xs may be derived by the method 
of least squares as follows: Consider the point P, Fig. 107. This, 
according to hypothesis, would fall at P’ if there were no forces 
associated with X other than Xs. Supposedly, however, there 
are other forces that are independent of Xz and make X; smaller 
than this average value so as to cause the point to be located 
actually at P. Since these other forces affect only Xj, the point 
is deflected in a vertical direction. The line of regression of X; on 

1 See Baren and Duncan, Sampling Statistics, pp. 372-375. 
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X, is therefore to be obtained by minimizing the vertical devia- 
tions from the line. , ' 
Let the equation of this line of regression of X; on X» be 


Xi = dis + bX: 
The deviations would then be 
di; X, — Xi Xi — a13 — bi Xs 


and the problem is to determine a,» and bis so as to minimize the 
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Fra. 107.— Diagram illustrating the fitting of the line of regression of X; on Xs 
by the method of least squares (vertical deviations minimized). 


sum of the squares of deviations like z, — E shown in Fig. 107, i.e. 
(X, — Xj = zı —z! because zı = X, — X, and EES 
Z(X; — X1)? = Z(X, — axa — bX»)? = minimum 


According to the differential calculus, the conditions for min- 
imizing Z(X; — ars — bywXs)? are that the partial derivative 
with respect to a;.» and the partial derivative with respect to by. 
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should both be zero, These conditions are 


9E(X; — axs — bX)? di 


003.2 


2Z(Xi— ais — bX) 
—-0-Zdi (1) 
—2Z(X; — ax» — bisX3) X, 
-0-2xd,X. (2) 


9Z(X;— ais — bis X3)? 
Obie 


If the parentheses are removed, these equations may be written 


Nars EE = 2X, 
aia2X, + bX? = xzX,X, 


(Zus = Nay. because a). is a constant.) 
The first gives a1,» in terms of bis as follows: 


ara = X, — bX, (3) 
(2X,/N = Xi, and ZX,/N = X.) 
If this is substituted in the second, the value of by is found to be 


D NXiX, 
X DX? — NX: 


(4) 


Iquations (3) and (4) thus give the values of oa and bis in terms 
of the sample values of X; and Aa, Ií these values are grouped 
into class intervals and deviations are measured from an arbitrary 
origin, the last equation may be put in the form 


Lou Q 
"ope OD 


2F(—) — N [— 
Dn D 
where 
EDU di EA d, 
"E A n a (5) 
dj PORN! Š A N 
If deviations are measured from the means of X; and X. 2, then 
12 = 0 
x 6 
Gë (6) 


D 


334 STUDY OF BIVARIATES AND MULTIVARIATES 


In the next chapter in which the work of measuring correlation 
is illustrated by numerical calculations, it is found that for the 
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Tio. 108.—Diagram illustrating the fitting of the line of regression of X: on Xi 
by the method of least squares (horizontal deviations minimized). 


bivariate frequeney distribution of Table 25, 


zb E El = 455 >F Dei = 493 
ti 2 i» 


€ np 0,8 
ty $381 is BI 


.. (81) 111) (57) 
Ne Re COPS 
ka (57)? 

493 — 81 Ss 
= 0.8322 


For these same data, X; = 217.4 and X, = 204.1 so that 
d,» = 217.4 — 0.832(204.1) = 47.58. The line of regression of 
X, on X; is thus X; = 47.58 + 0.8322X;. 
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Line of Regression of Xs on Xi. The line of regression of X» on 
X, may also be obtained by either freehand or mathematical 
methods. A freehand line could be obtained by drawing a line 
through the progression of the means of X; on Xi. A mathe- 
matical line could be obtained by the method of least squares. 

The preceding section determined a mathematical formula for 
ihe line of regression of X; on X; by minimizing the sum of the 
squares of the vertical deviations, Now X» is assumed to be the 
dependent variable, and the line of regression of X» on X; is 
determined by minimizing the sum of the squares of the horizontal 
deviations (see Fig. 108). Except for this difference, the process 
is precisely the same as that described for fitting the other line 
and will not be repeated here. If the line of regression of Xs on 
X, is represented by the equation 


Xi = doi + bah 


then minimizing Z(Xs — X2)? = Z(Xs — aa — bx X 1)? gives the 


following values for ga. and bax: 


82.1 = Xs = bağı D (7) 
meh XX X; — NXiX2 (8) 


zxi- NX} 


or 


bar 


If deviations are measured from the means of X; and Xs, then 


azı = 0 
T: Erato (10) 
ducas 


For the data of Table 25 the line of regression of X; on X; is thus 
found to be 


X} = —8.548 + 0.9642X, 


This is shown in Fig. 108. 
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Interpretation of a Line of Regression. A line of regression 
of one variable on another is to be interpreted as indicating the 
values of the first (the dependent variable) that would be 
obtained for various values of the second (the independent vari- 
able) if no other forces were affecting the dependent variable. 
If knowledge of the independent variable is all that is to be had, 
then the line of regression gives the best estimate that may be 
made of the dependent variable. 

The regression statistic a (that is, qis or asa) gives the value 
of the dependent variable when the independent variable is zero. 
It is of only arbitrary significance, since its value is affected by 
the origin selected for measuring the independent variable as 
well as the units of measurement. The regression statistic b 
(that is, bis or bsi) is independent of the origin selected and indi- 
cates the change that would occur in the dependent variable per 
unit change in the independent variable. In the line of regression 
of X, on Xs, for example, when Xs increases by one unit, X; 
increases or decreases by bis units depending on the sign of bis. 
The value of by will not be affected by proportional changes in 
the units of X, and Xs. Similar statements hold for bs; in the 
case of the line of regression of X» on Xj. 

Standard Deviation about Means or Line of Regression. If the 
progressions of the means or the lines of regression are used to 
measure the average relationship between two variables, some 
additional measure is desirable to determine the degree of repre- 
sentativeness of these measures. In the case of a monovariate 
distribution, it will be recalled, the representativeness of the 
mean depended upon how closely the cases were scattered around 
this mean value. This dispersion was measured by the standard 
deviation or some other measure. Similarly, in the present 
instance, the representativeness of the means of X,, say, for 
various values of X; will be shown by the dispersion of the cases 
around each mean. The standard deviation of the cases in each 
column around the mean of that column may thus be taken to 
show how well the mean represents the cases in the column. 
'The same is true for any row. 

In Table 27 are given the standard deviations of the columns 
of Table 25. The zero values refer to the columns in which 
there is only one case. The other values center around 16, their 
average being 16.9. 
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Tt is to be noted that the average standard deviation from the 
means of the columns, as well as the individual standard devia- 
tions from which this average is calculated, are considerably 
less than the total standard deviation of X;, namely, e = 43.9. 
The column means are thus much more representative of the 
column values of X; than the grand mean is of all the X;'s. 


TABLE 27.—Sranparp DEVIATIONS FOR Conumns or TABLE 25 


Column Ne DF ret he ge = SEES 
(1) 2 0 0 0 
(2) 0 
(3) 1 0 | 0 0 
(4) 7 8,342.86 1,191.8 34.5 
(5) 5 480.00 96.0 9.8 
(6) 8 1,400.00 175.0 13.2 
(7) 9 4,800.00 533.3 23.1 
(8) 13 2,830.77 217.8 14.8 
(9) 18 6,577.78 365.4 19.1 
(10) 11 3,200.00 290.9 17.1 
(11) 5 320.00 64.0 8.0 
(12) 2 0 0 0 3 
81 


This may be explained by the fact that much of the total varia- 
tion in X, is due to the variation from column to column, à 
variation that is presumably due to association with Xa, When 
this variation is eliminated, the remaining variation is consider- 
ably reduced. A similar analysis would show the same results 
with respect to variation around the means of the rows. 

If the association between X; and Xs is measured by a straight 
line, the representativeness of this line may be measured by the . 
dispersion of cases around it. Such à measure would be the 
standard deviation of the deviations from the line. The stand- 
ard deviation of the vertical deviations from the line of regression 
of X, on Xs will measure the representativeness of that line, and 
the standard deviation of the horizontal deviations from the line 
of regression of Xs on X; will measure its representativeness. 
In either case, c? equals the sum of the squared deviations from. 
the line divided by N. If the line is fitted by the method of 
least squares, the sum of the squared deviations from the line 
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may be computed from the equations! 
Nol, = Id, = DX? — au3ZX; — Bus EE (11) 


and i 
Noi, = Zd, = ZX? — as ZX, — bunZXiX; (12) 
or ge 
pal > 
LE ONG and 03; = TN 


These standard deviations from the lines of regression will always 
be less than the total standard deviation, because the variation 
represented by the line of regression has been eliminated by 
taking deviations from the line. 

The average standard deviations around the means of columns 
or rows and the standard deviations around the lines of regression 
may be called “first-order standard deviations," in contrast to 
the total standard deviations, which may be called ‘zero-order 
standard deviations." Sometimes the first-order standard 
deviations are called “standard errors of estimate" since they 
indicate the error involved in using à column or row mean or à 
line of regression as an estimate of the dependent variable. 

If the association between X; and X;, say, is assumed to be 
measured by the means of X, for given values of X; or by the 
line of regression of X, on Xs, then the smallness of the first- 
order standard deviations relative to the zero-order standard 
deviations will give some measure of the degree of representative- 
ness of these measures of association. As will be seen in the next 
section, this measure of the degree of representativeness of a line 
of regression is closely related to the so-called ‘Pearsonian 
coefficient of correlation." As a measure of the degree of repre- 
sentativeness of a progression of means, it is closely related to the 
“correlation ratio," which is discussed in Chap. XV, Nonlinear 
Correlation. 

The Pearsonian Coefficient of Correlation. The progressions 
of means and lines of regression described above were concerned 
with describing the “law of relationship” between the two 
variables. They gave the average value of one variable associ- 

1 The proof of this is as follows: 

Nda Sat — ara — biaX;) = di 2X1 gun ën — biazdisXs 


But by the least-squares equations (1) and (2) Edi. = 0 and Xdi.:X: = 0. 
Hence Zdj, = 2di2X1 = BX? — aiiZX; — buZXi 
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aled with given values of the other variable and showed how 
these average values tended to change in unison with the other 
variable. In this section, a measure of the degree of association 
between the two variables will be described. This measure is 
known as the Pearsonian coefficient of correlation after the man 
who devised it. 
Exports 
x, 
6 


0 l 2 3 4 
Billions of dollars 
Fic, 109.—A bivariate scatter diagram showing the joint variation in imports 
into and exports from the United States. [United States Department. of Commerce, 
Monthly Summary of Foreign and Domestic Commerce of the United States, Vol. 20 
(March, 1940), p. 37; Survey of Current Business, Vol. 21 (March, 1941), p. 37; 
Vol. 22 (March, 1942), pp. 5-19.] 


Imports 


The coefficient of correlation suggested by Karl Pearson in 
1890 is 


Erit 
2= 13 
T12 No: ( ) 


In this equation, xı and ze refer to deviations from the mean 
and N to the number of pairs of cases. For the sake of simplicity, 
this coefficient will now be explained by reference to a bivariate 
distribution in which the cases are not grouped into class intervals. 

Arithmetic View of r. Table 28 and Fig. 109 show the joint 
variation of two variables. They indicate that the large values 
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of X, representing total exports from the United States, 1932- 
1941, are associated, for the most part, with the large values of 
Xs, which represent total imports for consumption into the 
United States during the same period of years. 

The average of Xj, designated X;, is found to be 2.89; and the 
average of X», designated Ñ», is 2.19. The deviations of each 


TABLE 28.—Exr»omrs AND Imports or MERCHANDISE, UNITED STATES, 
1932-1941 
(In billions of dollars) 


"Total exports [Total immortel Deviations from | Product deviations 
X: | Xs respective X Tyri 
Year Leg S 
w o —— © 
| | (3) | (4) 
| | | + = 
1932 (ien, Al 1.3 —1.29 —0.89 | 1.1481 
1938 deg 1.5 —1.19 | —0.69 | 0.8211 
1934 2.1 1.6 —0.79 —0.59 | 0.4661 
1935 2.3 2.0 —0.59 —0,19 | 0.1121 
1936 2.5 2.4 —0.39 | Wel, at —0.0819 
1937 3.3 3.0 0.41 0.81 | 0.3321 
1938 3.1 1.9 0.21 See —0.0609 
1939 3.2 2.3 0.31 0.11 0.0341 
1940 4.0 2.6 EE 0.41 | 0.4551 
1941 5.1 3.3 2.21 1.11 | 2.4531 
>= 28.9 21.9 0 0 5.8218 | —0.1428 
X, = 2.89 | X, = 2.19 Zus = 5.6790 
: ZL 


variable from its respective mean are calculated and entered in 
the third and fourth columns of the table. The products of z, 
and zs, the product deviations, are calculated, and the results 
entered in the appropriate division of the last column. The sum 
of column (5), that is, Sasa (the sum of the product deviations), 
is 5.679. : 
In Fig. 109, an X; and an X, scale are set up in such a way 
as to accommodate the range of these variables as shown in 
columns (1) and (2) of Table 28. Lines perpendicular to the 
respective scales at the points X, = 2.89 and X, = 2.19 are 
drawn so that the figure is divided into four quadrants, quadrant 
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[ containing values of X, and X, that are both higher than 
average (hence both x; and zs are positive); quadrant IT contain- 
ing values of X» that are smaller than average and values of X, 
that are larger than average (hence zo is negative and z; is posi- 
iive); quadrant III containing values of X, and X; that are both 
smaller than average (hence both z and z» are negative); and 
quadrant IV containing values of Xs that are larger than average 
and values of X, that are smaller than average (hence x: is 
positive and z, is negative). The origin of the coordinates zi, 
a is at the intersection of the perpendicular lines at the Ky 
ind X, of the scales. For example, measured from the original 
igin, the point P has coordinates X; = 3.3, Xa = 3.0; but 
measured from the intersection of the means the coordinates of 
point P are zs = 0.41, zí = 0.81 [see columns (1), (2), (3), and 
(4) for 1937, Table 28]. It should be noted that only one point 
is plotted in the fourth quadrant; this is the 1936 pair of variables 
from Table 28. The 1938 pair of variables from Table 28 
ippears as the sole point in the second quadrant. These two 
pairs of variables, 1936 and 1938, are the only ones in the set 
hat have negative product deviations. "The rest of the pairs of 
observations appear either in the first or third quadrant because 
‘heir product deviations are positive quantities. 

If the fluctuations of two variables are so associated that their 
plottings appear predominantly in quadrants I and III, the 
Zz,rs will be positive. This will be so when larger than average 
values of X, are associated with larger than average values of 
X, (quadrant I) and smaller than average values of X, are 
associated with smaller than average values of X» (quadrant III). 
Also, if the two variables are so associated that their plottings 
appear predominantly in quadrants II and IV, the sum of the 
product deviations will be negative. This will be so when smaller 
than average values of X> are associated with larger than average 
values of X; (quadrant II) and when larger than average values 
of X, are associated with smaller than average values of X; 
(quadrant IV). Furthermore, if the plottings are equally 
distributed throughout the four quadrants, the sum of the 
produet deviations will approach zero because of the canceling 
of plus and minus product deviations. This will be so when 
there is no tendency for association of the variables in any manner, 
that is, when smaller than average values of X; are associated 
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about as often with larger values of X» as with smaller values of 
X3, etc. ; 
A similar procedure is followed in Table 29 and Fig. 110, in 
which X, is the price of United States government bonds and 
X. is the yield on such bonds. Casual inspection of the data 
reveals that, when the price of bonds is high, yield is low, and 
vice versa. 
TABLE 29,—Prices AND YIELDS on UNITED STATES GOVERNMENT Bonps, 


1932-1941 
Averages on bonds outstanding due or callable after 12 years 


Average Average dE SE des 
price yield, Product deviations 
($100 par) percent |—————,— N ar 
Year Xa x. E] zi 
a) (2) (3) (4) (5) 
+ — 
1932 98.8 3.68 | —5.62 0:949 | ..... —5.333 
1933 102.3 3.31 —2.12 0.579 —1.227 
1934 104.6 3.12 0.18 0.389 
1935 105.5 2.79 1.08 0.059 
1936 103.7 2.65 —0.72 | —0.081 
1937 101.7 2.68 | —2.72 | —0.051 
1938 103.4 2.56 | —1.02 | —0.171 
1939 106.0 2.36 1.58 | —0.371 —0.586 
1940 107.2 2.21 2.78 | —0.521 e —1.448 
1941 111.0 1.95 6.58 | —0.781 | ..... —5.139 
>= 1,044.2 27.31 0 0 1.030 —13.733 
K, = 104442 X, = 2.731 or net 
Ez, = — 12.703 


The sum of the product deviations in Table 29 is a negative 
amount, namely, —12.703. Comparison of Figs. 109 and 110 
will at once bring out the contrast in the location of pairs of 
plotted points. Whereas in Fig. 109 the points are mainly in 
quadrants I and IIT, the points in Fig. 110 appear principally in 
quadrants II and IV. 

Again, the same procedure is followed in Table 30 and Fig. 111, 
in whieh X; is the height of Princeton freshmen and X; is the 
grade of these freshmen in their examination in economies. 

In Table 30 the negative and positive product deviations so 
nearly offset each other that the sum of product deviations is 
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only 1.33. The tendency for the scatter of points throughout 
all four quadrants is depicted in Fig. 111 on page 345. 

These three arithmetic illustrations appear to show that 
the sum of the product deviations from the arithmetic means 
of variables, Ezirs, can be used to measure the extent to which 


Average price 


|X 3710442 


100 


Per cent as ë 


Fra. 110.—4 bivariate scatter diagram showing the joint variation in the 
price and yield of United States government bonds. [Federal Reserve Bulletin, 
December, 1938, p. 1045; July, 1940, pp. 701—702, and Survey of Current Business, 
Vol. 21 (March, 1941), p. 36; Vol. 22 (March, 1942), p. 18.] 


the variables are associated or related. Following are the reasons 
for this: à 

1. When smaller than average values of X, are associated 
with smaller than average values of X», the Zus products, being 
—z and —25, are positive, as shown in Tables 28 to 30. s 

2. When larger than average values of X, are associated with 
larger than average values of Xs, the gız products, being +21 
and +z», are also positive, as shown in Tables 28 to 30. 
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3. On the other hand, when smaller than average values of 
X; are associated with larger than average values of Xs, the 
ups products, being — and +z are negative, as shown for 
1932 and 1933 in Table 29. 

4. When larger than average values of X; are associated with 
smaller than average values of X», the zizs products, being + 
and —z», are also negative, as shown for 1935, 1936, and 1939 in 
Table 29. 


Taste 30.—Heicurs or FnEsHMEN, Princeton Cuass op 1941, AND 
Grapes ON EXAMINATION IN ECONOMICS 


Gradus of Deviations trom | 
Heights of | freshmen, respective X «Product. daviatlons 
freshmen, | percentage | |. ~~~ ~ ~~ qu 
in, of 100 I 
Xs Xe ES | zs 
(1) (2) (3) (4) | (5) 
| + - 
66.00 70 —3.96 43.8 | 15.048 
69.00 67 —0.96 +0.8 0.768 
70.50 66 +0.54 —0.2 0.108 
69.50 85 —0.46 +18.8 8.648 
68.00 55 —1.96 —11.2 
70.50 60 +0.54 m0:2 EE 3.348 
70.50 67 +0.54 +0.8 0.432 
71.60 81 +1.64 +14.8 24.272 
73.25 66 +3.29 23012: |i esiste 0.658 
70.78 46. - | 3E0: 79: J 21:2] ee 16.748 ` 
= = 699.60 ri DIES Bloc [pr Gef | +46.656 —45.326 
X, = 69.90 X, = 66.2 or net 
| | | Erste = +1.330 


5. When no consistent association prevails between the pairs 
of variables observed, the +> products will balance or very 
nearly balance the — zx, products, as shown in Table 30. 

'The sum of the produets of the deviations from the means 
indicates correspondence or lack of correspondence of variations 
in two sets of variables; but the simple sum of products cannot 
be taken as the measure of correlation between the two variables 
for the following reasons: 

1. The sum of product deviations for one set of paired vari- 
ables is not comparable with a similar sum of product deviations 
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for another set of paired variables. A small sum of product 
deviations may result from the fact that a small number of cases 
is included, and a large sum of product deviations may indicate 
merely that a large number of cases is involved; and yet the actual 
degree of correlation might be the same in the two sets. In the 


Freshmen heights 
X; e 
14 Xe 7662 
mr | T 
4 
+25 -X6 txs 4X65 
-X5 X6 EXS X6 
12} 
D 
at | 
p 
EI H H 
° 
D SS 
0E : AX, -6996 
5 . 
69 e 
68r : . 
-X5 -X6 -X5 Ee 
eL +x XG -X5 Xe 
IT ZE 
66 1 Ke 1 AX; 
45 55 65 15 85 DEAE 
Per cent grades 


Fie. 111,—A bivariate scatter diagram showing the joint variation (or lack 
of it) between the heights of Princeton freshmen and their grades on an exami- 


nation in economics. 
second instance the larger sum of product deviations is due solely 
to the fact that it resulted from a larger number of cases. It 
seems obvious that an average of the product deviations is 
required. Such an average can be obtained by dividing the 
sum of product deviations by N. Thus the average product 
deviation is Zzizs/N. ne d 

2. The product deviations in terms of original units of the 
data are without meaning because of nonhomogeneity of units. 
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Suppose the X; variable is the price of wheat per bushel, which 
would be expressed in dollars and cents; and the Xs variable is 
the birth rate. Or, again, suppose the X; variable is the height 
of men expressed in inches and the X variable is the weight of 
men expressed in pounds. Or suppose the X; variable is the 
marriage rate and the X» variable is the volume of trade, or 
prices, ete. Tn all such pairs of variables, the product deviations 
in terms of original units are meaningless; they are products of 
nonhomogeneous things. What meaning can be ascribed to the 
product of inches and pounds or to the product of marriage rates 
and volume of trade? It is necessary to perceive in the situation 
a general common denominator. 

The comparable thing being compared is the purely abstract 
thing, deviation above or below average; accordingly, the stand- 
ard deviation z may be used as a general common denominator. 
Whatever the original unit of measurement, if normally dis- 
tributed the standard deviation represents approximately 
one-sixth the range of that variable. The standard deviation 
is a unit of deviation from the mean measuring a common 
characteristic among all variables and is, therefore, a homo- 
geneous unit among all variables. Consequently, the standard 
deviation is used to reduce these product deviations to terms of 
comparability with each other. When this is done, the average 
product deviation becomes a measure of correlation known as the 
Pearsonian coefficient, namely, 


Since aí and o> are constants in each particular problem, this 
equation may be written as follows: 


This is the usual form in which the formula for the Pearsonian 
coefficient of correlation is given.. The value of this average 
expression fluctuates between the limits +1 and —1. Any 
value greater than +1 or less than — 1 is a mistake, not an error 
in the statistical sense. If r = +1, this means perfect positive 
correlation (large values of X; are associated with large values of 
Xə, and vice versa); if r = —1, this means perfect negative 
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correlation (large values of X; are associated with small values of 
Xs, and vice versa); if ri; = 0, this means no linear correla- 
tion. 

Calculation of the Coefficient of Correlation. The data in 
Table 28 may be taken to illustrate the detailed calculation 


TABLE 31.—CaLcULATION OF COEFFICIENT OF CORRELATION BETWEEN 
UNrrgp STATES Exports AND Imports, 1932-1941 


i Deviations from e MEI NM 
tivemenna,| guar of deviations ve etn VE 
units 
zi E P zs x S pad 
o e: o c? 
a [me Ei W 6» (6 @ 
< + _ 
—1.29|—0.89| 1.6641 0.7921 |—1.251,—1.435| 1.795) 
-1.19 |—0.69) 1.4161 0:4761 |—1.154|—1.112 1.283 
—0.79|—0.59| 0.6241 | 0.3481 |—0.766 —0.951 0.728) 
—0.59 |—0.19} 0.3481 0.0361 |—0.572, —0.306 0.175 
—0.39| 0.21) 0.1521 0.0441 |—0.378| 0.338)  ..... —0.128 
0.41 0.81 0.1681 0.6561 0.398) 1.306 0.520 
0.21 |—0.29} 0.0441 0.0841 0.204|—0.467| sss ss —0.095 
0.31 0.11| 0.0961 0.0121 0.301| 0.177 0.053 
1.11 0.41| 1.2321 0.1681 1.077 0.661) 0.712) 
2,21 1.11) 4.8841 1.2321 2.144) 1.789 3.836 VS 
$ 10.6290 | 3.8490` |.....-.|....... X = 9,102/—0,223 
| or net 
e = 1.031 lox = 0.6204 
Zm. _ g 879 
| ou 902 


The standard deviations were calculated from the sum of columns (3) and (4). 


of the coefficient of correlation, by first making all product 
deviations in terms of respective standard-deviation units. 

The Pearsonian coefficient of correlation may now be quickly 
calculated from the sum of product deviations in standard- 
deviation units [the foot of column (7) of Table 31]. This sum 
divided by N is the coefficient of correlation. In other words, 


Zi Xe 
v.c. _ 8.879 
DENIED DS 
— 0.8879 
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Tt is not necessary, however, to divide each deviation by its 
standard deviation because the two standard deviations are 
constants. Table 28 having been constructed, if the standard 
deviations are calculated, as in columns (3) and (4), Table 31, 
it is then necessary only to use Eq. (3), as follows: 


Zus 5.6790 .. 0.5679 
Now: —10(1.031)(0.6204) — 0.6396 
= 0.8879 


Accordingly columns (5) to (7) of Table 31 need not be computed. 
For example, to caleulate the coefficients of correlation for the 
data in Tables 29 and 30, the standard deviations are caleulated 
and the respective coefficients of correlation are then obtained 
as follows: 

Correlation between prices and -yields on United States 
government bonds, 1932-1941: 


ME. Za _ — 12.703 

1 Nee, ` 10(3.16)(0.51) 
` —1.2703 os = 3.16 
ent c4 = 0.51 
— —0.7882 


Correlation between heights and grades of freshmen: 


Se Etits ; 1.33 ; 
Ness ` 10(1.89)(10.96) 
1-104338. 1.- 7 0,138 cs = 1.889 
- (1.89)(10.96) — 20.7 es = 10.96 
— 0.0064 


For a small number of cases it is possible to caleulate a coeffi- 
cient of correlation according to the procedure illustrated in the 
tables and calculations immediately preceding. For a large 
number of pairs of values it is desirable to group the pairs into 
class intervals. The value of X; for each pair then becomes the 
mid-point of the interval to which the X; value belongs; the 
value of X» for the pair will be the mid-point to the interval 
to which the X; value belongs. If more than one pair of cases 
belongs to the same X, and X; intervals, the frequency of such 
pairs is determined. This procedure was illustrated by the 


—€— 
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analysis of Tables 24 and 25 in discussing the bivariate frequency 

listribution of 81 Mount Holyoke freshmen.! When the bivari- 
ates are arranged in a bivariate frequency distribution, ri» is 
measured by 2Fzxya2/No2 where F represents the frequency of 
pairs of values belonging to the same X; and Xa intervals. For 
xample, in Table 25 (for X, = 160-, X.) = 120-), F = 5, 
xı = 47.4, and ps = 74.1. Accordingly, this Frits (for X; = 160-, 
X, = 140-) is equal to 5(47.4) (74.1) = 17,561.7. When this pro- 
cedure is followed for the entire table, the ZFzizs is obtained. 
Special methods for calculating r from grouped data are described 
n detail in Chap. XIV, in which advantage is taken of certain 
:hort-eut procedures. 

Relationship between Lines of Regression and r. If a line 
of regression is fitted by the method of least squares, the values 
f by and bs, are given by Eqs. 6 and 10. It will now be shown 
that these reduce to formulas involving rı: From the defini- 
iion of r = Zzyrs/No05, 


Yat, = Nowe 


secondly, note that, from the definition of ei = Xa$/N, 


Sr = Noi 
Henee, 
b Zus Noii _ 5 gi 
= rmm Eco 138 
a KC No} ES 


In the same manner it can be shown that 

T 
bs = Tia — 
9i 


Hence, if deviations of the variables are measured from their 
mean values, the lines of regression may be written (in this ease 
the ara and asa are both zero) 

Í z (14) 


gy = ris— T2 


¿= raz (15) 


dg = fidi 


If the first of these equations is divided by z: and the second by 


1 See pp. 325-326. 
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o2, they become 


i T2 
Lm 
01 92 
SERLO 
riz iii: 
92 Kéi 


Thus it may be concluded that if the variables are measured in 
standard-deviation units, the slopes of the lines of regression are 
the Pearsonian coefficient of correlation. In this light, rı: is 
the change in the average value of X; expressed in z units when 


2f 
[7 


Fra. 112.—Diagram showing relationship between lines of regression and the 
Pearsonian coefficient of correlation r. 


X; changes by one øs unit. Tt is also the change in the average 
value of Xs expressed in e; units when X; changes by one c; unit. 

'This property of r is shown geometrieally in Fig. 112. "This 
shows that the slope of the regression of X, on X; is r, with 
reference to the X;-axis, and 1/r with reference to the X raxis; 
that is to say, the line of regression of X; on X, makes an angle 
with the X;-axis equal to r and an angle with the X-axis equal 
to l/r. The slope of the regression of X, on X, is likewise ^" 
but with reference to the X;-axis, and 1/r with reference to the 
Xs-axis; that is to say, the line of regression of X; on X, makes an 
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angle with the X;-axis equal to r and an angle with the X;-axis 
equal to 1/r. In other words, in Fig. 112, angle a equals a’, 
and angle @ equals angle 6’. All this is on the assumption 
that the variables are expressed in standard-deviation units as 
indicated in the equations above. 

Thus, in Fig. 112, the tangent of the angle a is r, and that of 
the angle 8 is 1/r, When le| < z/4, r 2 tana S 1. Geo- 
metrically, within the limits lol < m/A, tan a varies between +1 
and —1, passing through zero, and tan @ between +1 and —1, 
passing through infinity. The two lines of regression merge into 
one line when r = 1 (for tan a = 1 when the angle is a 45-degree 
angle). d 

Relationship between r and the First-order Standard Devia- 
tion. It will be recalled that the standard deviation of the 
leviations from the line of regression of X; on Xs is equal to 

Mei, = DX? — ai3ZXi — bsZXiXs 
If the variables are measured from their mean values, this 


becomes 
Noi, Xx — bedr: 
But 
Xi- Not ba — 2 and ` Zopp = News 
a 
Hence, 
No?, = Not — Noirs 
and 
01.2 = eil — ria) 
Finaily, Des a 
cus = or VI — rh (16) 
In the same manner, Er E 
2.1 = 02 vA — ris (17) 
'These formulas may also be put in the form 
2 
g: 
ER (18) 
[41 
9i 1 1 
Hg (19) 
o? 


It will thus be seen that r is closely related to the scatter 
about the lines of regression, Tf this scatter is a small percentage 


» 
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of the total scatter, indicating a high degree of representativeness 
of a line of regression, then r is high. If the scatter is a large 
percentage of the total variation in the dependent variable, 
indicating a low degree of representativeness of a line of regres- 
sion, then the value of r is small. In other words, the better 
a line of regression fits the data, the higher the value of r, and 
vice versa. The Pearsonian coefficient of correlation is thus a 
measure of the goodness of fit of the lines of regression. 

The Pearsonian Coefficient of Correlation and the Analysis of 
Variance. For every point on a bivariate scatter diagram such 
as Fig. 107, there is a corresponding point on the line of regression 
of X, on Xs. Geometrically this is obtained by projecting the 
point vertically onto the line of regression (see Fig. 107). Alge- 
braically, the z; coordinate of a point on the line of regression is 
found by substituting the given value of zs in the regression 
equation x, = ri» S 

L^ 

When the variables are measured from their mean values, the 
mean of the various values of z; is zero. Hence the mean of 
the corresponding values of xi is zero also. The standard 
deviation of these x; values is thus 


Equation (16) may thus be written 


2 pe: 2 
Uia = g1 — Gy 
or 
LM Sc | 
oi = ozy + di.s (20) 


This says that the total variance of the xı values is equal to the 
variance of the corresponding points on the line of regression 
plus the variance of the deviations from these points. Another 
way of looking at this is to regard the total variation in X, as 
made up of two parts, one consisting of the variation (¢2,) due 
to its association with X» as represented by the line of regression, 
the other representing the variation in X, due to its association 
with factors independent of X» (that is, o? A. 
Similar analysis shows that 


9$ = oF, + oha (21) 


a 
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in other words, that the total variance in X> is made up of a part 
"3 due to its association with X; as represented by the line of 
regression of X» on X; and a part (c,) due to its association 
with faetors independent of X» as measured by the deviations 
from this line of regression. 

The formula gi, = risi, which may also be written rj, = o2;/o1, 
heds further light on the meaning of r. It shows that ri, 
measures the proportion of the total variance in X, that is due 
io its association with X». It also measures the proportion 
ol the total variance in Xs that is due to its association with X;. 


CHAPTER XIV 


COMPUTATION OF r AND OTHER MEASURES 
OF CORRELATION 


The previous chapter was concerned with an explanation 
of the various devices used to measure the association between 
two variables. This chapter will illustrate their use by carrying 
out a numerical analysis. Only linear correlation will be con- 
sidered here. Measures of nonlinear correlation will be discussed 
in Chap. XV. 

The order of analysis will be first to calculate the correlation 
coefficient. This will be done for both ungrouped and grouped 
data, and use will be made of short-cut methods of calculation. 
For the grouped data, lines of regression will be computed, and 
first-order variances and standard deviations. Reference will 
again be made to the progressions of means, but the analysis will 
be continued no further than in the previous chapter. 

Computation of r from Ungrouped Data. Since z = X — X, 
it follows that, 


Ee O(N — XX) 5 3X3X4 NN 


Likewise, z, and e; are equal to 


EXT and 


Hence the correlation coefficient ean be computed from the 
equation 
2 XX X, — NXi;X; 
Tu = —— — — - (1) 
zXi- NX? V 2X} — NX} 

To illustrate the use of this formula consider again the data 
on exports and imports of Table 282: These are reproduced 
in Table 32, together with the calculations of ZXiX» Xi, Xs, 
DXj, and XX. Three check columns are also employed. The 
checks are column (1) + column (2) = column (6); 


column (3) + column (4) = column (7); 


and column (3) + column (5) = column (8). 
1 See p. 340. 
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In the preliminary calculations of Table 32, the last check 
failed. This showed that a mistake had been made in either 
column (5) or column (8), for column (3) checked with columns 
(4) and (7). After some investigation the mistake was found in 
column (8). By dividing the checks up in this way, an error 
can be easily located. This sort of check is called a “Charlier 
check." 


TABLE 32.—Womk SHEET ror COMPUTING r FROM UNGROUPED Data 


a) (2) (3) (4) (5) (6) Di (8) 
x: x: XXa | Xs) | Xe | Xi + X: |XX: + X? | Xx(Xi X3 
1.6 1.3 2.08 2.56) 1.69) 2.9 4.64 3.77 
1.7 1.5 2.55 2.89] 2.25) 3.2 5.44 4.80 
2.1 1.6 3.36) 4.41| 2.56 3.7 7.77 5.92 
2.3 2.0 4.60 5.29 4.00 4.3 9.89 8.60 
2.5 2.4 6.00 6.25 5.76 4.9 12.25 11.76 
3.3 3.0 9.9010.89 9.00, 6.3 20.79 18.90 
3.1 1.9 5.89 9.61| 3.61) 5.0 15.50 9.50 
3.2 2.3 7.3610.24 5.29 5.5 17.60 12.65 
4.0 2.6  10.4016.00| 6.76 6.6 26.40 17.16 
5.1 3.3 16.8326.0110.89 8.4 42.84 27.72 
z = 28.9 21.9 68.9794.1551.81 50.8 163.12 120.78 
X,-2.89| X, = 2.19 
Cheeks: 


z(1) + zQ)- x( 
28.9 + 21.0 = 50.8 
z(3)-4 z(a) = x(7) 
68.07 + 94.15 = 163.12 
z( + 3(5) = (8) 
68.97 + 51.81 = 120.78 


From Table 32, r is found according to Eq. (1) to be equal to 


68.97 — 10(2.89)(2.19) — — 
n = 794.15 — 10 X £89) V (51.81 — 10 X 2.199) 


68.97 — 63.291 (3 
= V415 — 83.521) / (51.81 — 47.961) 
E 5.679 d 5.679 c, = 4/1.0629 
4//(10.629) 4//(3.849) (3.26)(1.962) 5, = 4//0.3849 
5.679 = 
= sag DEER 
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TABLE 33.—Grapes IN SECOND- AND FIRST-SEMESTER ENGLISH, 81 FRESH- 
MEN AT Mount HOLYOKE 
(A, B, C, and D grades have been converted to a numerical scale) 


cond- irst- econd= | irst- 

Student. semester semester Student. semester semester 
number grade grade number grade grade 

Xi X: Xi X: 
il 240 220 41 260 260 
2 200 180 42 180 160 
3 260 240 43 100 60 
4 260 260 44 200 220 
5 160 160 45 200 200 
6 240 220 46 160 120 
7 220 200 47 180 160 
8 60 120 48 280 220 
9 220 240 49 200 200 
10 200 180 50 220 220 
11 220 220 51 220 200 
12 140 180 52 240 220 
13 160 120 53 100 60 
14 240 260 54 220 220 
15 260 240 55 240 200 
16 200 160 56 200 220 
17 200 160 57 220 220 
18 240 240 58 220 200 
19 240 220 59 240 200 
20 240 220 60 180 140 
21 160 140 61 160 140 
22 220 240 62 240 220 
23 200 200 63 260 200 
24 100 100 64 160 120 
25 160 140 65 260 240 
26 200 160 66 220 180 
27 180 180 67 220 240 
28 180 160 68 260 260 
29 240 240 69 240 220 
30 200 200 70 200 200 
31 200 200 71 140 120 
32 180 160 72 260 240 
33 260 220 73 200 180 
34 160 120 74 300 280 
35 240 240 75 180 140 
36 220 220 76 220 180 
37 220 240 77 180 180 
38 160 120 78 300 280 
39 200 200 -| - 79 220 220 
40 220 220 | a 220 200 
LES 200 180 
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l'his agrees to two decimal places with the previous calculations 
j| this coefficient made in Chap. XIII. The difference is 
due to the different ways of rounding off decimals in making the 
calculations. 

Computation of r from Grouped Data. The Data. The data 
to be used to illustrate the computation of r for grouped data 
are given in Table 33. They may be explained as follows: 

First pair Xi, Xs. The first pair of observations are the 
econd-semester and the first-semester English grades, respec- 
lively, of student 1, viz., 240,220. ` 

Second pair Xi, Aa, The second pair of observations are the 
second-semester and the first-semester English grades, respec- 
lively, of student 2, viz., 200,180. 

Third pair Xı, Xs The third pair of observations are the 
second-semester and the first-semester English grades, respec- 
lively, of student 3, viz. 260,240. 

The Correlation, or Bivariate Frequency, Table. After the data 
have been tabulated as in Table 33, a correlation table, which 
is in effect a bivariate frequency distribution, is constructed. 
"he table is set up with class-interval scales suitable for each 
variable,! and additional columns and rows are arranged for the 
required calculations. In the center of each cell of the correla- 
tion table, frequencies are shown; for example, in the first 
column opposite the X; scale of Table 34, 2 is the frequency of 
occurrence of X, between 100 and 120 and X; between 60 and 80. 
‘Two students, in other words, have grades in second-semester 
English between 100 and 120 and grades in first-semester English 
between 60 and 80. When all the frequencies are recorded in 
the correlation table, it may be used as a work sheet for the 
caleulation of the coefficient of correlation. 

Short Method for Caleulating r. Like the standard deviation 
and the mean, it is possible to find r by a short method making 
use of arbitrary origins. 

In the formula for 7, 

Zl. (2) 


e 
1 Nom 


cı and eu may be calculated by the short method that has already 
It remains only to evaluate Zar. jn terms 


been presented.? 
election of class intervals, see pp. 199—206. 


1 On the question of proper s 
? See pp. 214-215. 
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of deviations from the arbitrary origins, A, and As, selected 


for the respective variables and in terms of the two correction 
factors Ci and Cs. To do this, note that! 


zi = d, — Cy 
where 
EFd 
eae 
and 
T = ds SS Ce 
where 
_ EFd 
C: = N 
Therefore, 
Zu a = XF(di— C) (d, — C2) (3) 


which expanded is as follows: 
Friza = VFdydy —CiZFd,; — C2=Fd, + NO: (4) 
But Zb = NC; and XFd» = Na, and hence 
Fræ: = XFdidy — NCC. (5) 


and accordingly the formula for calculating r by the use of an 
arbitrary origin for X; and an arbitrary origin for X. is 
EFdid. — NO G; 


eats 6 
riz EGER (6) 


Further saving in caleulation results, however, if this formula 
is put in terms of class-interval units. In other words, the follow- 
ing form is more conveniently used: 


didz _ 4, C1 C 
di d» jy te i 
figu e E (7) 
On 
DES 


The correlation table serves as a work sheet for the calculation 


of the coefficient of correlation, as follows: 
1 When C, = I X — Au +C. By definition zı = X, — X, and 
dı = X, — A; so that zı = di + Ai — 41 — €i = di — Cs. 1 / 
2 The value of the numerator alone is EFzizs; if the problem is one in 
multiple correlation, it will be convenient to have a record of this value as 


well as the value of ris. 


ə 
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1. An arbitrary origin is chosen for each variable; thus, in 
Table 34, 4; = 190 and Az = 190. The arbitrary origins are 
taken at the mid-points of a class interval about midway in the 
range of the distribution in order to reduce to a minimum the 
necessary computations. 

2. A column at the side and a row at the bottom of the cor- 
relation table are used to indicate, in class-interval units, the 
deviations of each variable from the respective arbitrary origins. 
This supplies entries for the rows under the caption d;/i, and 
entries in the columns opposite the stub headings ds/i». In Table 
34, ù fa = 20. 

3. The next column at the side and row at the bottom of Table 
34 are for the purpose of entering the frequencies multiplied by 
the class-interval deviations. The sums of this column and row, 
respectively, are used in the calculation of the correction figures 
C;/i, and C2/iz and in the computation of the means of the 
separate frequency distributions. The sums of the columns give 
the separate frequency distribution of X», and the sums of the 
rows give the separate frequency distribution of X. 

4. The next column and the next row are for the frequencies 
multiplied by the class-interval deviations squared, in order to 
obtain sums from which to calculate the standard deviations of 
the respective variables. 

5. The means and standard deviations of the two variables 
are calculated as follows: 

Caleulation of the means: 


X, = 190 + 4 (20) = 190 + 27.40740 


T SONT 
X, = 190 + 1(20) = 190 + 14.074 
= 204.1 


Calculation of the standard deviations:? 


v1. 543 Ce (EBENE 
GC — (2) = 6.70370 — 1.87791 


= 4.82579 
-7 = 2.1968 
t 
sı = 43.94 


1 Using Eg. (3), p. 213. 
2 Using Eq. (5), p. 215. 
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2 n 2 
(2) mE Di — 6.08642 — 0.49519 


is iS NS 
= 5.59123 
7: — 2.3646 
12 
ga = 47.29 


6. The product of the deviations from the chosen arbitrary 
origins is obtained for each cell in the correlation table. This is 
obtained by multiplying the dı/iı by the dz/ix corresponding to 
‘he position of that cell. For example, for X, = 100- and X: = 
50-, the cell in the first column and third row of Table 34 there 

a frequency of 2. According to the chosen arbitrary origins, 
this cell has a product deviation (in terms of class-interval units) 
ef —6 multiplied by —4, or +24. Symbolically, this is (di/à) 

i/i). The table is divided into four quadrants by lines 
through the A; and As, 

Two of these quadrants will have positive product deviations, 
ind two will have negative product deviations. A product 
leviation is entered in each cell that contains frequencies and 
appears in the lower right corner of the cell. None are entered 
in the first quadrant because no frequencies occur in that quad- 
rant. Frequencies oceur in only one cell in the third quadrant, 
that is, in the X, = 200-, X» = 160- cell, for which the deviation 
product is —1 multiplied by +1, or — 1. 

7. The product deviation in each cell is multiplied by the 
frequency occurring in that cell, in order to obtain the proper 
number of product deviations of that particular cell. The 
product deviation occurs once in some cases and several times in 
others. Obviously, when it occurs several times the sum of the 
product deviations is obtained by multiplying by the frequencies. 
These figures are entered in each cell in the upper right corner. 
Symbolically, they are Fil: /ix)(d2/i2), for each cell. 

8. The sum of the figures calculated in item 7 is obtained, that 
is, the sum of the product deviations multiplied by their respec- 
tive frequencies. This is accomplished by adding the figures 
occurring in the upper right corner of each cell by rows and by 
columns and adding the sums of the rows or the sums of the 
columns to obtain the final sum. If both the latter are com- 
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puted, there will be a cross check on addition. Symbolically, the 


3 d d: 
final aggregate is e F E (2). 


9. 'The coefficient of correlation is caleulated by the use of 
Eq. (7) shown above, as follows: 
Calculation of r: 
rg =z 495 — 8184 ht 
12 — 81(2.19677) (2.36458) 
_ 455 — 78.11111 _ 376.88889 
— 420.74964 420.74964 


Lines of Regression and First-order Variances. All the values 
that are needed to find the lines of regression of Table 34 have 
now been calculated. There are two lines of regression for each 
correlation table—the first one represents the regression of X, 
on X; and the second the regression of Xs on Xi. Since r has 
been computed, the easiest formulas for calculating these two 
lines (in original units measured from the intersection of the 
means of the two variables as an origin and not in class-interval 
units) are as follows: 


= +0.89576 


z = r y ër in 
ko 71 
These equations can be expressed in the units of the original 
data, that is, the scale as originally formed rather than in devia- 


tions from the means, as follows: 
BS Xp Pot (Xe — ks) KN a Sy = X) 
T2 gi 


Calculation. For the problem illustrated, the lines of regres- 
sion are ealeulated as 


UE , 13.9354 7 wp 47.2915 
c, = 0.89576 17.2915 2° x, = 0.89576 33.9354 7! 
= 0.83222, = 0.96422, 


By substituting X; — X, for z! and X; — X, for x}, these 
equations may be written as follows: 
Xi — 217.4 = 0.8322(X, — 204.1) 
d Xj = 0.832X; + 47.58 
X; — 204.1 = 0.9642(X, — 217.4) 
X; = 0.964X, — 5.55 
Ia this form the equations are more easily interpreted as 
prediction equations. The first equation says that when a 
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student has a grade of 100 in first-semester English the predicted 
grade in second-semester English is 83.2 + 47.6 = 130.8. The 
second equation says that when a student has a grade of 100 
in second-semester English the predicted grade in first-semester 
English is 96.4 — 5.55 = 90.8. 

The two lines of regression are shown in Figs. 105 and 106.1 
In Fig. 105 line aa’ represents the first line of regression, 


X; = 0.832X2 + 47.58 


Fhe small crosses show the location of the means of the columns 
(calculated and shown in Table 34). It is to be noted that the 
line of regression follows the progression of the means of the 
columns. 

In Fig. 106, line bb’ represents the second line of regression 
Xi 0.964X, — 5.55. The small circles show the location of 
the means of the rows (calculated and shown in Table 34). 
itis to be noted that the line of regression follows the progression 
of the means of the rows. 

The scatter about each of the lines of regression, the first-order 
o, is calculated by using the following formulas: 


= — yi 
sis = c1 Vl — rb gaa = ss Vl — ri 


In the problem acae; the first-order variances are 
calculated as follows: 


43.94 (0.44453) 724 = 47.29(0.44453) 
= 19.53 = 21.02 


(When r = 0.89576, 4/1 — r? = 0.44453.) 

In Figs. 105 and 106, which show the lines of regression, there 
are also shown the limits indieated by the first-order standard 
deviations. Between these limits, that is, the line of regression 
ez for Fig. 105 and the line of regression +02, for Fig. 106, 
lie roughly two thirds of the frequencies, if it can be assumed 
that the population from which the sample is derived is normally 
distributed. This gives some idea of how accurate estimates 
based upon the lines of regression are likely to be. It is to be 

!See pp. 328, 329. 

* Calculation of 4/1 — rż and 1 — r? is avoided by the use of J. R. Miner, 
Tables of 4/1 — r? and 1 — r? for Use in Partial Correlation and in Trigo- 
nometry or an ordinary table of sines and cosines, since sin z = wl — cos? z. 
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noted that all the means of the columns lie within the limits 
described by the first-order standard deviations and that all but 
two of the means of the rows lie within these limits. Each of 
the latter two means of rows, lying outside these limits (the first 
row and the next to the last row) is based upon only one student: 
record. 

Progressions of Means. For these data the progressions o! 
means have already been discussed in Chap. XIII. Figures 
105 and 106 show the means of the columns and the means of the 
rows plotted from the values computed in Table 34 and repro- 
duced in Table 26.1 Figure 105 represents the means of the 
vertical frequency distributions of Tables 25 and 34; it gives the 
progression of the means of X, with changing values of Xs. Fig- 
ure 106 gives a similar analysis for the means of the rows. 


1 See pp. 330 and 358. 


CHAPTER XV 
NONLINEAR CORRELATION 


All the foregoing discussion has been concerned with those 
ases in which the progression of the means is linear. In such 
it was found that r = Zus Noen was an appropriate 
measure of correlation. If the progression of means and the 
distribution of cases around the means is as pictured in Fig. 113, 
however, r may show little correlation, especially in such cases 
— A and C, although there may be a high degree of association 
Letween the variables. It is the purpose of this chapter to 
indicate ways of describing and measuring such nonlinear 
correlation. 

\s indicated in an earlier chapter, the best way of studying 
any correlation is to make a bivariate scatter diagram of the 
data, Tf the data are numerous enough to be grouped into class 
intervals, then the means of the rows and columns may be 
computed and the variation in the means of each variable with 
changes in the other variable may be studied. 

In the linear case in which a line of regression was used to 
measure the association it was found that, the smaller the 
scatter, the higher the degree of correlation, the equation being 


Eques 
=1 E 


2 
riz 


The same sort of formula may be used to measure the degree 
of relationship indicated by the progression in the means. To 
distinguish them from the correlation coefficient these measures 
are called “correlation ratios” and are defined by the formulas 


KE 

tig = 1 — E 3 s 
ii s (1) 
X—3 

75h —1— BW 


where X, represents the means of X, for various values of X», 
365 
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X, represents the means of Xs for various values of X;, and 


ci v, and o%_z, represent the sum of the squared deviations 
around the means pooled for all the column or row means and 


Ranges for 
calculating 
EUM TS 
LEE 
` Zi i 
E | 
e 
e oy, e 


X, 
Fro. 113,—]llustrations of various kinds of nonlinear correlation. 
divided by N. Thus, 
yD (X, Eis 
SC 


Bebe = EE KE 
YY (Xe Ei 
ox, = 74 Nar esas 


TED eee ; 2 $5 Mer. 
The correlation ratios nj, and "n give some indication of the 
degree to which the means of one variable are successful in 
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measuring the variation in the other variable. They may be 
used to measure either linear or nonlinear correlation. 

If the means of one variable seem to mark off a definite curve 
or if in the case of ungrouped data a bivariate chart indicates a 
fairly definite form of nonlinear variation, then the average 
variation in one variable with changes in the other variable may 
be indicated by drawing a smooth curve or fitting one by some 
mathematical process, such as the method of least squares. 
Such a eurve might be called a “curve of regression.” A line 
of regression on the one hand indicates the average change 
in one variable with a unit change of the other variable; this 
average change is the same for all values of the independent 
variable, since the slope of a straight line is constant. A curve 
of regression on the other hand gives the average change in one 
variable with a unit change in the other variable; but this average 
change varies from one value of the independent variable to 
another, since the slope of a curve changes at each point. The 
technique of fitting a curve of regression will be discussed in a 
subsequent section. 

To measure the degree with which a curve of regression 
measures the association between two variables, an index of 
correlation is defined in a manner similar to the definitions of 
rand q. It depends on the closeness with which the various 
cases are scattered about the curve and is defined by the formula 


2 
Oxie 
Iis =1- — 
ci 
2 2) 
TRE 9 x:—can 
21 ci 


where Cy. and C; refer to the regression curves, ei, ou refers 
to the variance of the deviations from the curve of regression of 
X, on Xe, and o%, en refers to the variance of the deviations 
from the curve of regression of X» on X;. 

Although za = r= the two correlation ratios and the two 
indexes of correlation are not necessarily equal. That is, 
me Æ nz and Iis Z Izn. In addition, n 2 J 2 r. 

Since the variance about the means or about a curve is never 
greater than the total variance, these formulas always give a 
positive value and their square roots are indeterminate as to 
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sign. The square roots of n? and I? give an index of correlation 
and the question as to whether it is a positive or negative rela- 
tionship must be answered by reference to a correlation tabl 
or a figure showing the polygon or curve of regression. In 
the case of curvilinear correlation, the question of positive o: 
negative relationship often is irrelevant because two variables 
may be positively correlated up to a certain point and then 
negatively correlated beyond that point. Consequently, it 
becomes necessary to describe the entire relationship. For 
example, the death rate due to puerperal septicemia is correlated 
with ages of the female population in a nonlinear manner. The 
relationship between the two is best described by a curve or 
polygon of regression, which would have to be seen in its entirety 
if the relationship is to be completely understood. This is illus- 
trated in Fig. 55 (page 151). If r merely were calculated, it 
might conceivably be zero when there is in fact a close relation 
ship. An index of such a relationship is found by the calculation 
of the ais or the Da, 

Calculation of the Correlation Ratio. The calculation of the 
correlation ratio will be illustrated by reference to the Mount 
Holyoke data in Table 34 (page 358). Although the relationship 
appears to be linear, it is worth while to compute the correlation 
ratio to see how close it comes to r. If the difference is not very 
great, the linearity will be numerically demonstrated. 

Equation (1) for the correlation ratios may be put in the form 


2 
Leg 
Tiere o ESCH 
1 
yess (3) 
2 93 — Oma 
Tisi 3 
93 


where op, and o», are abbreviated expressions for ox_x, and 
ox—x, and thus represent the average standard deviations around 
the means of the columns and the means of the rows, respectively, 
as explained above. In order to apply these equations for finding 
the correlation ratios it is necessary to find the values of c, and 
e This can be most conveniently done with the help of a 
work sheet that makes use of arbitrary origins (A, and A») and 
class-interval deviations G and a ` Buch a work sheet is 
1 2 
Table 35 in which the computations are carried out for the data 
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c: Table 34. The algebraic foundation for these computations 
s as follows: 

[t is assumed that the same A; is used for every column as 
for the total frequency distribution of X;. Then for each 
column the sum of the squares of the deviation from the column 
mean would bet 

Ne Ne 


> [oe o GE (X 5 3] 


Therefore, 


N A 
Nose » ay = » AM 4 
d c AP Ne e 

1 1 

It has been determined already that? 


dy 

Fai 2 yee 

Noi > : d ( A 
PN 3 


An 


1This follows from Eqs. (1) and (2) of Chap. VII. For it will be noted 
ei 
and y = zF (2 / Nand vs ZFz/N. 

2See Eqs. (1) and (2) of Chap. VII and previous footnote. 
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Each of the variances in Eq. (3) may be expressed in class- 
interval units so that its numerator is the arithmetic difference 
between Eqs. (5) and (4) and its denominator is Eq. (5). Thus, 


niz x (x ay (6) 
5 po 
» p(&y NA GJ 
m N 
Similarly it can be shown that for a table with “P” rows 
2 
| G , a) Qui véi 
e mE 
"Ah 3 (7) 


once 


All the items in these two formulas (6) and (7) are to be found 
on the work sheet in Table 34 with the exception of 


š ANC N: an? 
Guy aË 
3 eos Ge tı a s + x 


1 1 


These two figures are obtained from the correlation-ratio work 
sheet (Table 35). 

In Table 35, the frequency is placed in large type in the center 
of a cell. Each column is now regarded as a separate frequency 
distribution whose total number of cases N, is shown in the 
row headed N. For each column the same arbitrary origin 
(A; = 190) as that used in Table 34 is used; hence the same 
dı/i, can be used for each column. 

For all 11 columns in the upper right corner of each interval 
that contains a frequency is a number in small type representing 
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theF 2 tor that interval of the column.- These are then summec 
£ 
N 


: di P 
giving for each column a J F i each of these sums is show: 
1 


1 
Ne 


in the row with the stub title > F = If each sum is divided 
d 
1 


by the number of cases in the column, N, and multiplied by /, 
the resulting number is the correction factor C. for that column 
Accordingly, the mean for that column (X,) can be found by 
using the formula X, = 4; + C, The results of this calcula- 
Ç 1 ; 
tion are shown in the row with the stub title xu. and th« 
column means are shown in the row with the stub heading X.. 
In order to obtain the figure to be used in the formula for the 
correlation ratio—that is, for the square root of nj.—anothe: 


row of figures is now added to Table 35; this set of figures con 
Ne / 
H 
sists of the (X52) /N. for each column; and when thes: 
1 
i / 


are summed for all columns (say for “m? columns), the resulting 
figure is as follows: ' 


N 
ry 
D 
H = 
» ew uen cri 473.1215 


Using Eq. (6), the correlation ratio of X, on X; is thus! 


473.1215 — QD? 
111)? 
Bic i 
. 473.1215 — 152.1111 _ 321.0104 
543.0000 — 152.1111 390.8889 
`= 0.82123 
m2 = 0.9062 


d du? 
1 The values of y> F = = 111 and > F (e) = 543are found in Table 34. 


In that table the same A; was used for the frequency distribution of X;. 
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To calculate the means of the rows and the correlation ratio 
of X, on X;, every row of Table 35 is treated as a separate 
frequency distribution. The same As is used for each row as 
the A» in Table 34 for the entire X; distribution. Ee 


the same set of d2/i; may be used for each row. The Lg = for 


each interval of each row is placed in small type in the ees 
right corner of the interval. These summed for each row give 
Ny 
the > F a » shown in the column with that title heading. From 
D 


these are obtained the C, for each row, by the same procedure 
as that used for finding the column means. For each row, the 
Nr D 
(Š F ST / N, is then computed and entered in the column 
2 
1 if 
with the title (1)?/N,. The sum of these for all row frequency 
distributions (say l rows) constitutes the aggregate 


SËCH 


l 
1 Ae oy 
> m = 430.2445 
1 
This is the value required by Eq. (7) for finding 721. Thus,! 
2 
436.2445 — £505 
2 8L 
N21 uz 
493 — Qu 
436.2445 — 40.1111 _ 396.1334 
~ 493.000 — 40.1111 ~ 452.8889 
= 0.87467 
na = 0.9352 


The Correlation Ratio and Analysis of Variance. The square 
of the correlation ratio is a measure of the proportion of variance 
due to correlation, in the same manner as it was indicated that 


1 The values for » rä = = 57 and > F (GI = 493 are found in Table 34. 


In that table the Ae is ác same as the A» used in the present table. 
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the square of the coefficient of correlation is a measure of pro- 
portion of variance due to correlation. 

As has been explained, when expressed in the form vie = ei 
the square of the coefficient of correlation reveals itself as the 
proportion of the total variance that is due to correlation or 
association with X, as measured by the line of regression of X, 
on Xs In a similar manner, 72,02 = o}, and likewise 22,02 = o%,. 
The square of the correlation ratio thus describes the proportion 
of the total variance that is due to correlation as measured by 
the fluctuations in the means of the columns and rows. The 
standard deviation of the means of the columns squared is the 
variance that is due to correlation of X, with X, and similarly 
for the correlation of X» with X;. 

To demonstrate algebraically that gie = ei, = ei — o2, it 
is necessary first to note that by definition the mean of the 
weighted means of the columns equal X;. By definition, 


Ne 
Xx 
ed 
D= N 


X 
and thus 
C Le 
Nox > X; 
1 


which, if summed for all columns, becomes 


But 


and hence, if Eq. (8) is divided by N, it is equivalent to 


NN. T 
1 E 
NW 9 


which was to be proved. 
If Xi, the mean of the entire X, distribution, is now selected 
as the arbitrary origin, 


Xe= X. 0. Von LO —X)— X 
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Also, when the mean of the entire distribution is selected as 
the arbitrary origin for each column, the standard deviation 
of the column is found by 


Ne 
ya 
e 
bg 
fi N. 


On substituting C. = X, — X, and transposing, an expression 
for each column similar to the following will result: 


Ne 
ya 
(a) N = š + (X, — 1)? 
Multiplying the equation for each column by its V,, respectively, 
will result for each column in 


CG [at = (X, Xy] 


Ne : 
(b) at= No? + N.(X. — X) 
1 
When the whole series, one for each column, of equations such as 
(b) are totaled, the following result is obtained: 
m Ne m m ^ E 
(c) NXu3NeNG. X 
Zuch 1 1 


But, in this equation, 


m Ne N 
+> = >) a} = Ned 
nost 1 

and 


m 
V E 2 
Y Nat = Nona 
i 


Moreover, by definition, the explained variance, that is to say, 
the variance of the means of the columns about the weighted 


mean of these means, is as follows: 
m E = 
Ne, = ZN E YN. - X 
Consequently, (c) may be written 
No} = Nei, + Nos, 


5 


376 STUDY OF BIVARIATES AND MULTIVARIATES 


or 
2 


= 2 
oz on. + OR. 


Substituting the value of c2, = ei — e, in Eq. (3) for the 
correlation ratio gives the following: 


Similarly, it can be shown that 


2 
21 


2 
KE 


Calculation of Curvilinear Regression. 


a 


2 
Xr 

2 

2 


2 
721 


si 


(10) 


(11) 


Toillustrate the statis- 


tical problem involved in curvilinear regression and the calcula- 


TABLE 36.—SrTocks, PRODUCTION, AND Imports or CorroN AND PRICE 


or Corron RECEIVED BY PRODUCERS IN THE UNITED STATES 


Stocks at beginning of crop year plus year’s production plus net imports. 
Prices are deflated by United States index of wholesale prices for crop years. 


Deflated 


Stocks, 


Year z 1 production, and 
Pt canta per pound | DEE 
Xi (s 
1920-1921 13.47 1.726 
1921-1922 18.06 1.480 
1922-1923 22.63 1.306 
1923-1924 29.30 1.274 
1924-1925 22.63 1.550 
1925-1926 19.19 1.805 
1926-1927 12.92 2.195 
1927-1928 20.95 1.711 
1928-1929 18.71 1.749 
1929-1930 18.34 1.755 
1930-1931 12.13 1.862 
1931-1932 8.38 2.378 
1932-1933 10.30 2.307 
1933-1934 14.04 2.162 
1934-1935 15.76 1.765 
1935-1936 13.83 1.815 
1936-1937 14.48 1.821 
1937-1938 10.29 2.382 
1938-1939 11.17 2.383 
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tion of the correlation index J, data on cotton stocks, production, 
and imports compared with cotton prices, 1920-1939, have been 
selected. They are shown in Table 36 and plotted in Fig. 114. 

The position of plotted bivariates in Fig. 114 suggests that a 
curve such as aa’ might fit the data. The question of the type of 
curve fitted is of particular importance in curvilinear regression 
and accordingly three types will be discussed for illustrative 
purposes, ` 


U: Ss BC of RUN in Gate oP 10 million GE 


Fic. 114.—Bivariate scatter diagram and fitted curve showing relationship 
between the price of cotton and the supply of cotton. 


Logarithmie Regression. The constant slope of a straight 
line depicts the fact that the change in X, is constant for a 
given quantity of change in X», and vice versa. The changing 
slope of à curve depicts the fact that change in X; varies for 
different values of X», and vice versa. One such curvilinear 
relationship between X; and X; is as follows: 


xxi =k (12) 


In Eq. (12) the varying manner in which X, fluctuates with , 
respect to X, depends on the exponent b. If b is larger than 1, 
a small change in X, must produce a large change in X, because, 
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as the equation indicates, their product (when X, is raised to 
the b power) is constant. If b is equal to 1, the changes in Xi 
must be just proportionate (in an inverse manner) to the changes 
in X» If b is less than 1, the changes in X; must be proportion- 
ately less than the changes in X». If such an equation is used 
to deseribe the line of regression of price of cotton on stocks 
and production of cotton, a very flexible price of cotton will 


X, 
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Logs of U.S. supply of cotton, 1920-19. 
Fie. 115.—The relationship of Fig. 114 in logarithmic form. 
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result in a value of b larger than 1; a very inflexible price of 
cotton will result in a value of b less than 1. The nature of 
Eq. (12) assumes that the flexibility in the price of cotton remains 
the same regardless of stocks and production, because it sets up 
the hypothesis that the product equals a constant. 

If such an equation is assumed to be suitable for the problem 
in hand, the fitting of the curve of regression may be simplified 
by first transforming the equation to its logarithmie form, namely, 


log Xi + b log X; = log k 
a  iflgk-—a (13 


II 


D 


NONLINEAR CORRELATION 379 


Figure 115 shows the effect of transforming the bivariate fre- 
quency distribution from original units to logarithmic units. 
The data plotted are the same as the data plotted in Fig. 114, 
except that, in Fig. 115, the X, and X; scales refer to the log- 
arithms of X; and X». When the bivariate logarithms shown in 
tho first two columns of Table 37 are plotted in Fig. 115, a straight 
'"U'4mug 37.—LOGARITHMS or UNITED STATES PRODUCTION, STOCKS, AND 


IMPORTS OF COTTON AND OF THE PRICE or Corron RECEIVED BY PRODUCERS 
With columns for the squares of the logarithms and their cross products 


X; = price of cotton 
X, = stocks, production, and imports of cotton 

log Xi log X: log X; log X; log? Xi log* Xs 
1.1294 0.2370 0.2677 1.2755 0.0562 
1.2567 0.1703 0.2140 1.5793 0.0290 
1.3547 0.1159 0.1570 1.8352 0.0134 
1.4669 0.1052 0.1543 2.1518 0.0111 
1.3547 0.1903 0.2578 1.8352 0.0362 
1.2831 0.2565 0.3291 1.6464 0.0658 
1.1113 0.3414 0.3794 1.2350 0.1166 
1.3212 0.2333 0.3082 1.7456 0.0544 
1.2721 0.2428 0.3089 1.6182 0.0590 
1.2634 0.2443 0.3086 1.5962 0.0597 
1.0839 0.2700 0.2927 1.1748 0.0729 
0.9232 0.3762 0.3473 0.8523 0.1415 
1.0128 0.3631 0.3677 1.0258 0.1318 
1.1474 0.3349 0.3843 1.3165 0.1122 
1.1976 0.2467 0.2954 1.4343 0.0609 
1.1408 0.2589 0.2954 1.3014 0.0670 
1.1608 0.2603 0.3022 1.3475 0.0678 
1.0124 0.3769 0.3816 1.0250 0.1421 
1.0481 0.3771 0.3952 1.0985 0.1422 
^ E = 22.5405 5.0011 5.7408 27.0945 1.4398 


line fits the points. Thus the logarithmie transformation has 
converted a eurvilinear correlation problem into a simple linear 
correlation problem in which the Pearsonian coefficient of 
correlation is wex ez, and the line of regression of log X; on 
log X; is as follows: 


Clog X. 
log X, — mean of log Xi = Mog xi log Xs s: = (log X» 


— mean of log X:) 


H 
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The equations of regression could be obtained in the above form 
and then transformed into their antilogarithmie form; but in 
this problem it is more convenient to find the regression equation 
directly from the least-squares equations. Accordingly, the 
regression statistics a and b of Eq. (13) may be calculated by 
using the following least-squares equations:' 


. D log X, = Na + bZ log X; 
Z log X, log X» = aZ log Xs + b= log? X, 


Table 37 is a work sheet providing columns to calculate £ log Xi, 
> log X», > log Xi log X», 2 log? X,, and 2 log? X,, using the 
data of the cotton problem for which the raw data are found in 
Table36. The first two columns of Table37 show the logarithms 
of the price of cotton in the United States and of the stocks, pro- 
duetion, and imports of cotton. The third column contains the 
cross products of the logarithms. The fourth and fifth columns 
contain the squares of the logarithms in the first two columns. 
The sums of the columns provide the values that are required to 
find the regression statistics a and b, for Eq. (13). 
Calculation of the regression of log X; on log X»: 
22.5405 = 19a + 5.00115 

5.7468 = 5.0011a + 1.4398b 
In order to solve, eliminate a by multiplying the second equation 
by 3.7992 and subtract it from the first, as follows: 

22.5405 = 19a + 5.00116 

22.8332 = 19a + 5.47016 

0.7073 = —0.4690b 
b = —1.5081 


Substituting this value of b.in either of the equations will show 
that, 


a = 1.5833 


Accordingly, the equation of logarithmic regression of log A: 
on log Xs is as follows: 


log X, = 1.5833 — 1.5081 log X; 
which may be transformed into antilogarithmie form as follows: 


X Xy!5081 = 38.31 
1 See p. 333. 
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Reciprocal Regression. Reciprocal regression is a special 
form of the type of regression indicated by Eq. (12); for if b = 1, 
changes in X, are related reciprocally to changes in X». In 


0.051 


10 15 20 mue 
Fira. 116.— The relationship of Fig. 114 in reciprocal form. 
other words, the equation becomes 
1 
XX: = F or x. = RE: (14) 
A1 
which, placed in à more general form, is as follows: 
EE (15) 
Xi 


If the reciprocal of each X; is found, it is possible to find the 
equation for the reciprocal regression by fitting a straight line 
to X» and the reciprocal of X, that is to say, by fitting an equa- 
tion such as (15). Figure 116 shows the effect of transforming 
one of the variables of the bivariate frequency distribution from 
original units to reciprocal units. In the figure the vertical 
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scale is 1/X, while the horizontal scale remains X». When the 
bivariates shown in Table 38 are plotted in Fig. 116, a straight 
line fits the points. Thus the reciprocal transformation has 
converted a problem in curvilinear correlation into a problem 
in simple linear correlation in which the Pearsonian coefficient 
of correlation is r} and the line of regression is as follows: 


— rà 
zi 


91 
1 e See e 
= — Xi = rr _ (xX. X, 
X, xi Lew ) 


TABLE 38.—UxrrEp STATES SUPPLY or COTTON AND THE RECIPROCAL or 
THE Price or Corron RECEIVED BY PRODUCERS 
With columns for the squares and the cross products 
X, = price of cotton 
X, = supply of cotton 


x 1 1 
Xs xi Ze Xe 
1.726 0.07424 0.12814 2.97908 0.00551 
1.480 0.05537 0.08195 2.19040 0.00307 
1.306 0.04419 0.05771 1.70564 0.00195 
1.274 0.03413 0.04348 1.62308 0.00116 
1.550 0.04419 0.06849 2.40250 0.00195 
1.805 0.05211 0.09406 3.25803 0.00272 
2.195 0.07740 0.16989 4.81803 0.00599 
1.711 0.04773 0.08167 2.92752 0.00228 
1.749 0.05345 0.09348 3.05900 0.00286 
1.755 0:05453 0.09570 3.08003 0.00297 
1.862 0.08244 0.15350 3.46704 0.00680 
2.378 0.11933 0.28377 5.65488 0.01424 
2.307 0.09709 0.22399 5.32225 0.00943 
2.162 0.07123 0.15400 4.67424 0.00507 
1.765 0.06345 0.11199 3.11523 0.00403 
1.815 0.07231 0.13124 3.29423 0.00523 
1.821 0.06906 0.12576 3.31604 0.00477 
2.382 0.09718 0.23148 5.67392 0.00944 
2.383 0.08953 0.21335 5.67869 0.00802 
X — 35.426 1.29896 2.54365 68.23983 0.09749 


The equations of regression could be obtained in the above form 
and then transformed into their original units, but in this 
problem it is more convenient to find the regression equation 
directly from the least-squares equations. The normal least- 
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squares equations are as follows: 


Xx Na+ b Xs 
Xs = 
Lee 


Table 38 is a work sheet with columns in which the required sums 
are obtained. Entering these sums in the above least-squares 
equations makes it possible to evaluate the regression statistics 
a and b for Eq. (15). 

Calculation of the regression of 1/X; on X«: 


1.29896 = 19a + 35.4260b 
2.54365 = 35.4260a + 68.23983b 


Multiplying the first equation by 1.8645263 and subtracting the 
result from the second equation eliminates a and gives a solution 
for b as follows: 

b = 0.05564 


Substituting this value in either equation gives the solution of a 
as follows: 
a = —0.03538 


The equation of regression is therefore as follows: 


J = = 0.08538 + .05561X, 
This equation describes the straight line plotted in Fig. 116. 
Plotted on scales of X; and X», the equation is à curve. 

Parabolic Regression. The curvilinear relationships so far 
considered have been relationships that could readily be trans- 
formed to a linear form, by taking logarithms or reciprocals. 
Such transformations reduced the problem to one of simple 
linear correlation between the transformed variables, and there 
was little in the analysis that was different from that of the 
previous chapters. A curve that cannot easily be transformed 
to a linear form is the parabolic relationship 


X, = a + bX; + bX; (16) 


H 
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This must be fitted directly. Fortunately, the nature of 
the curve is such that the method of least squares can be used. 
According to this, to fit a parabolic regression the least-squares 
equations are obtained as follows: 

The least-squares criterion is that 


Xd? = X(X, — X1)! = minimum 
or 
E(Xi— a — biX; — b;X$) = minimum 


For this to be a minimum its total differential should be equal 
to zero; that is, differentiating with respect to a, bı, and b; and 
setting equal to zero, the following normal equations are obtained : 


DX, = Na + bi XX; + bX? 
DXX, = aXX. bi 2X? DEA 
XX, X; = aXXi-F bi DXF + bo DX} 


Table 39 is a work sheet providing for the calculation and 
checking of the sums entering into the three parabolic equations 
of regression. Using the sums of the appropriate columns 
the following set of equations is obtained for the calculation of 
the regression statistics a, bi, and bs for the regression of X, on X», 
shown in Eq. (16): 


306.58 = 19a + 35.4260, + 68.2398b» (D 
542.7359 = 35.426a + 08.2398b, + 135.4744; (IT) 
994.1092 = 68.2398a + 135.17440, + 276.3974b. — (III) 


The solution of three equations for three unknowns should be 
undertaken in an orderly manner; this is attempted in Table 40, 
which is a work sheet following the so-called Doolittle method. 
"This work sheet provides a step-by-step check on the caleulations 
as the solution of the equations proceeds. In order to avoid 
copying a, by, and ba each time an equation is written down, a, bi, 
and bs are written as the titles of columns in which their coeffi- 
cients are entered. In the table only the coefficients are entered 
in their respective columns with the proper sign before each figure. 
For example, row (1) of the table is presumed to read as follows: 


19a + 35.4260b; + 68.2398b, — 306.58 = 0 


which is the first equation above with slight rearrangement of 


terms. 
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Three steps are involved in solving three equations for three 
unknowns: (1) to get an equation in the three unknowns in which 
the coefficient of one of the unknowns is unity, (2) to get an 
equation in only two of the unknowns in which the coefficient 
of one of the two is unity, and (3) to get an equation in only one 
of the unknowns in which its coefficient is unity. When the 
third step is accomplished, the value of the third unknown is 
obtained. This value, applied in the equation obtained by the 
second step, makes it possible to evaluate the second unknown; 
and the third unknown is then obtained by applying these two 
values in the equation obtained by the first step. This is the 
same process as that used for finding two unknowns from two 
equations. 

Table 40 provides an orderly procedure and also a check 
for these steps. "The first step is accomplished in row (2) of 
the table, by multiplying Eq. (I), eopied in row (1), by the 


s A LT 
negative reciprocal of the coefficient of a, that is, by 19° this 


will make the coefficient of a become —1. The second step, 
rows (3) to (6), eliminates a from two of the equations in order 
to obtain in line (5) an equation in b, and bə» In order to 
climinate a, the first equation must be divided by its own coeffi- 


cient of a and multiplied by the coefficient of a of Eq. (II); in 
RA — 35.4260 
other words, Eq. (I) must be multiplied by EH The 


multiplier is given a negative sign so that, when added to Eq. (ID, 
the a term will cancel. Row (6) divides row (5) by the negative 
reciprocal of the coefficient of b; in row (5). 'The third step, rows 
(7) to (11), accomplish the elimination of two of the variables, 
ending with an equation in only one of them, which of course gives 
its value. In order to do this, Eq. (III) is copied in row (7); 
Eq. (1) is multiplied by a number that will give it a e 
of a equal to the coefficient of a of Eq. (IIT), that is, by Go ] 
and this is entered in row (8); then the equation obtained in 
row (6) (in terms of only b; and bs, a having been eliminated) is 
multiplied by a number that, combined with the two coefficients 
of b; in rows (7) and (8), will give a sum of zero. The sum of 
rows (7) to (9) will then eliminate both a and.b;, giving in row (10) 
such an equation. When row (10) is multiplied by the negative 
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reciprocal of its coefficient of b», the value of be is obtained; this 
is shown in row (11). 

A column for sums is provided in order to obtain a step-by- 
step check on all calculations. This is done by applying to 
the sums the same multipliers as those applied to the equations. 


—1 
For example, the sum of row (1) multiplied by + should equal 


the sum of row (2). In the eolumn headed Cheeks are entered 
the products obtained by multiplying the sums as indicated 
under Remarks to visualize the checks. 

From Table 40, the values of a, bı, and bz, are obtained from 
equations in rows (2), (6), and (11), as follows: 


Row (2), —a — 1.864520b; — 3.59157b. + 16.1358 = 0 
Row (6), —b, — 3.76733b. — 13.2095 = 0 
Row (11), —b, + 7.9944 = 0 
bs = 7.9944 
bi = —3.76733(7.9944) — 13.2095 
= —43.327 
a = 43,327(1.864526) — 7.9944(3.59157) + 16.135789 
= 68.2077 


The equation of regression of X, on X; is therefore as follows: 
X; = 68.2077 — 43.327 X» + 7.9944X3 


Estimates Based on Regression Equations. Using the three 
equations of regression caleulated above for the regression of 
X; on X», that is, for the regression of the price of cotton on 
production, stocks, and imports of cotton in the United States, 
estimates may be made of the price that will result from a given 
volume of stocks plus production plus imports. The three 
equations are as follows: 


Logarithmic regression, log X, = 1.5833 — 1.5081 log X; 
Reciprocal regression, x = —0.03538 + 0.05564X; 

AY 
Parabolic regression, X: = 68.2077 — 43.327X, + 7.9944X2 


To illustrate the method of estimation, suppose the questions 
are asked: What is the expected price of cotton if the cotton 
stocks plus the year’s production and imports amount to 25 mil- 
lion bales? What is the expected price of cotton if the cotton 
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stocks plus the year’s production and imports amount to 22 
million bales? 19 million bales? 16 million bales? 13 million 
bales? Only 10 million bales? How much higher will the 
price be in a year of shortage than in a year of large carry-over 


TABLE 41,—EsrrMATES or Corron Prices BASED on THREE REGRESSION 
CURVES 
Estimates based on logarithmic regression 


Estimate of 


log Xs 15898 SUP OBI dog s = log Xi log Xi SR 
0.39794 1.5833 — 1.5081(0.39794) — 0.98317 9.62 
0.34242 1.5833 — 1.5081(0.34242) — 1.06690 11.67 
0.27875 1.5833 — 1.5081(0.27875) — 1.16292 14.55 
0.20412 1.5833 — 1.5081(0.20412) — 1.27547 18.86 
0.11394 1.5833 — 1.5081(0.11394) — 1.46612 29.25 
0.00000 1.5833 — 1.5081(0.00000) — 1.58330 38.31 


Equation of estimate 1 TENA 
—0,03538 + 0.05564X2 = s x xi 
2.5 —0.3538 + 0.05564(2.5) = 0.10372 9.64 
2.2 —0.3538 4- 0.05564(2.2) — 0.08703 11.49 
1.9 —0.3538 4- 0.05564(1.9) — 0.07034 14.22 
1.6 —0.3538 + 0.05564(1.6) = 0.05364 18.64 
1.3 —0.3538 + 0.05564(1.3) = 0.03695 27.06 
1.0 —0.3538 4- 0.05564(1.0) — 0.02026 49.36 


Estimales based on parabolic regression 


e waer ABSENT E TAX» = Xi Fic d es 
2.5 68.2077 — 43.327(2.5) + 7.9944(6.25) = 9.85 
2.2 68.2077 — 43.327(2.2) + 7.9944(4.84) = 11.58 
1.9 68.2077 — 43.327(1.9) + 7.9944(3.61) = 14.75 
1.6 68.2077 — 43.327(1.6) + 7.9944(2.56) = 19.35 
1.3 68.2077 — 43.327(1.3) + 7.9944(1.69) = 25.39 
1.0 68.2077 — 43.327(1.0) + 7.9944(1.00) = 32.87 


and large produetion and imports of cotton? Table 41 shows 
how these estimates are made by using the above three equations 
of regression. When the year’s cotton stocks, production, and 
imports are large, the three regression equations give results 
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that are approximately equal to each other; but when the year’s 

cotton stocks, production, and imports are small, the estimates 

based upon the three regression equations differ sharply from 
each other. 
First-order Standard Deviation Used as Standard Error of 
Estimate. The dispersion about a curve of regression can he 
measured in the same manner as the dispersion of cases about a 
' progression of means or a line of regression. The measure 

generally used is the standard deviation and is called a “ first- 
* order standard deviation” or a “standard error of estimate,” 
because it is the standard deviation of the residuals about the 
curves of regression by means of which estimates such as those 
illustrated in Table 41 are made. 

For the illustration in which cotton stocks, production, and 
imports are correlated with cotton prices compared with cotton- 
price correlation, three types of regression lines have been fitted 
as follows: 


log X1 = a + b log Xs (A) 
1 
v bX (B) 
xi a+ 2 
Xj = a + bX, + bat? (C) 


The standard error of estimate, being a standard deviation, is 
defined as follows: 
Nei. = Xd? (17) 


where each d is defined, taking regression type (C), for example, 
as 


d = X, — X, = X, — a — bX; — bX? (18) 
Hence, each d? will be as follows: 
d? = d(a — b:X2 — b:X:) = dX, — ad — biXsd — b,Xid. (19) 
If all these d?'s are added, the following result is obtained: 
2d? = ZX d — add — biZXd — ba EK (20) 


By the least-squares condition, however, the last three terms 
of Eq. (20) are equal to zero, for! 
Id = E(Xi— a — biX, — X2) = 0 
2Xd = 2X2(Xi — a — biXs — bX?) = 0 
EXA = BER — a b X — bX?) = 0 
1 See p. 384. 
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Therefore, Eq. (20) reduces to the following: 


Sd? = BXid = SX. — a — bX, — bX?) 
EE 


(21) 


Accordingly, the formula for the square of the standard error of 
estimate is as follows: 

i DX} — aXX, — bi2X X; — bhZXiXj 

053 = N c es 


If regression type (B) were taken, it can be shown similarly 


ihat 
ub e 
2 Xi X; X; 
N (23) 


(22) 


ADI) 


If the logarithmic regression equation is chosen, the standard 
error of estimate is found by a similar procedure to be as follows: 
S 3 log? X, — a> log X; — b> log Xi log Xs 
Oy gos N (24) 
The values necessary to calculate these standard errors of 
estimate are available, respectively, in Tables 40, 39, and 38. 
Calculation of standard error-of estimate: For the logarithmic 
regression: 


27.0945 — (1.5833)(22.5405) — (—1.5081) (5.7468) 


ots 
H 19 
27.0945 — 35.6884 + 8.6668 _ 0.0729 
d 19 IO 
— 0.0038 
01,5 = 0.06164 


By using the ordinary formula for the standard deviation, 


CLE =x? = SCH 
dx N N 


(when the arbitrary origin is taken as zero), the necessary figures 
are found in totals of the appropriate columns of Table 37, and 
1 The scatter formula for the logarithmic regression could be calculated 
by using the formula employed in the linear case, as follows: 
cis = si = Tl Xilog X:) 
Since, however, the logarithmic > has not been caleulated, it is simpler to 
use the formula based on the least-squares equations. 


3 
392 STUDY OF BIVARIATES AND MULTIVARIATES 


it is found that 


0.0187 
0.1368 


2 
VE 


[41 
For the reciprocal regression: 
2 -_ 0.09749 — (—0.03538)(1.29896) — (0.05564) (2.54365) 
Datt 19 < 


0.09749 + 0.04596 — 0.14153 _ 0.00192 
"n 19 19 


= 0.000101 


oi = 0.01 


The standard deviation of 1/X, is found by using the following 


formula 3; y " , (53) 


D ig 
Xi N 


The necessary values are found in the sums of the appropriate 
columns of Table 38. 
s = 0.00869 


x 
o, 0.0932 
x, x 
For the parabolic regression: 


5,451.3758 — 68.2077 (306.58) — (— 43.327) (542.7359) 


A — (7.9944) (994.4092) 
TAN 19 ; 
5,451.3758 — 20,911.1167 + 23,515.1183 — 7,949.7049 

= 2" x. i$ 2 3 

_ 105.6725 

y 19 

= 5.5617 
71. = 2.3583 


Using the ordinary formula for calculating the standard deviation 
when zero is taken as the arbitrary origin, 


vj = 26.5508 cı = 5.1528 


1 The scatter formula for the reciprocal regression could be calculated by 
using the formula for the linear case, as follows: 


ela = oill — GB 


Since, however, the reciprocal r has not been calculated, it is simpler to use 
the formula based on the least-squares equations, 
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Table 42 is a summary of the estimates of cotton prices made 
above, together with ranges of plus and minus one standard 


TABLE 42,—EsrrMATES AND RANGES OF TWICE THE STANDARD ERROR or 


ESTIMATE FOR Corton PRICES BASED oN THREE REGRESSION CURVES 
Estimates and ranges, logarithmic regression 
ei 
Estimated logarithms Estimated 
E e R: f 
log of price d ee SS price, antilogarithms 
| log Xi eis | log X1 enz 
0.98317 1.04481 0.92153 9.62 11-19 8.35 
1.06690 1.12854 1.00526 11.67 13.45 11.29 
1.16292 1.22456 1.10128 14.55 16.77 12.63 
1.27547 1.33711 1.21383 18.86 21.73 13.23 
1.46612 1.52776 1.40448 29.25 33.71 25.38 
1.58330 1.64494 1.52166 38.31 44.15 33.24 


— 1 1 l ll. m 
Estimates and ranges, reciprocal regression 


Estimated eal 
reciprocal Estimated Range of estimated 
of price SSeS See price price, converted from 
ES i 1 Xi reciprocals 
Xx xi Tena x; 7 mt 
0.10372 0.11372 | 0.09372 9.64 8.79 10.67 
0.08703 0.09703 | 0.07703 11.49 10.31 .| 12.98 
0.07034 0.08034 | 0.06034 14.22 12.45 16.57 
0.05364 0.00364 | 0.04364 18.64 15.71 22.91 
0.03695 0.04695 | 0.02695 27.06 21.30 37.11 


0.02026 0.03026 0.01026 49.36 33.05 97.96 
O 
Estimates and ranges, parabolic regression 
EE 

Standard error of 


Estimated estimate 
price 
Xi 


Xitos | Xi — oa 


9.85 | 12.21 7.49 
11.58 | 13.94 9.22 
14.75 | 17.11 | 12.39 
19.35 | 21.71 | 16.99 
25.39 | 27.75 | 23.03 
32.97 | 35.23 | 30.51 

ree en ee se ee 


error of estimate. In the cases of the logarithmic and reciprocal 
regressions these ranges are converted into original units of 
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the data in order to show their significance. The differences 
are notable. For the lower levels of price, the reciprocal 
regression gives estimates with small standard errors of estimate, 
but for the higher price levels the standard error of estimate is 
smallest with the parabolic regression. Each of these methods of 
calculating regression curves assumes that the variance in X, 
is the same for all subgroups of X, associated with varying 
values of X». The logarithmic regression assumes that, when 
converted into logarithms, the variance about the logarithmic 
regression is equal at all points but that, when converted into 
antilogarithms, it will be larger for the higher prices. The 
reciprocal regression assumes equal variance about the curve in 
terms of reciprocals but, when converted, the variance about the 
higher prices is larger than the variance about the lower prices. 

The question suggests itself: Which one of these three assump- 
tions about the character of variance about the curves of regres- 
sion best suits the data of the particular problem? This question 
is answered by determining which of the regression eurves is the 
best fit for the data in question. 

Correlation Index. For each of the curves of regression 
calculated in the previous section, a corresponding index of 
correlation will help to determine which of the regression curves 
is the best fit for the data. The standard error of estimate 
measures the divergence of the bivarjates from the curve of 
regression; the correlation index measures the goodness of fit 
of the curve of regression. The indexes of correlation may be 
caleulated by using Eq. (2). 

Caleulation of Indexes of Correlation: For the logarithmic 
regression: 

Tn oi — gio 0.0187 — 0.0088 — 0.0149 


DET 0.0187 ~ 0.0187 


Ii» = 0.8926 


For the reciprocal regression: 


p 0.00869 — 0.00010 — 0.00859 
0.00869 0.00869 
= 0.9885 
The = 0.9942 
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For the parabolic regression: 
nm = 26.5508 — 5.5617 _ 20.9891 


26.5508 ~ 26.5508 
= 0.7905 
Ii = 0.8891 


The high correlation index obtained for the reciprocal regres- 
sion appears to indicate that the cotton supply and price data 
for the period 1900 to 1940 are correlated in a reciprocal manner. 
It indicates that the sample data are fitted by the reciprocal curve 
of regression better than by either the logarithmic curve or the 
parabolic curve. 

It is to be noted that, in general, the use of the index of correla- 
tion to show which curve is the best fit is valid only when all 
curves have the same number of regression statistics. Here two 
curves had two regression statistics and one had three. A curve 
with a larger number of regression statistics will always give a 
better fit than a similar curve with a smaller number of regression 
statistics. Here, however, the parabola that had three regres- 
sion statictics gives a worse fit than either the logarithmic 
or the reciprocal curve, each of which has only two regression. 
statistics. 

The Index of Correlation and Analysis Variance. As already 
pointed out, in the cases of the logarithmic and reciprocal curves 
of regression, the Pearsonian coefficient of correlation may be 
calculated. When transformed into original units, this coefficient, 
of correlation becomes the index of correlation. In the problems 
above illustrated, however, the correlation index was calculated 
instead by using the general formula based upon the scatter 
because the arithmetic involved in the latter method is simpler. 
Tn logarithmic and reciprocal units, respectively, the coefficient 
of correlation squared is, for these curves of regression, a coeffi- 
cient of proportional variance just as is the 7? for simple linear 
correlation problems. 

For the parabolic curve of regression, the deviations from the 
curve of regression may be described as 


X,— X|—-d and Xij-Xi—d 
1f these are added for the entire data, the result is 


2X = DX — Bd 
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and since Sd = 0 it follows that 
SEI = 2X, = NX 
and hence the mean of X1 equals the mean of X;. 


Consequently, the sum of squares of X | may be obtained as 


follows: a 
Na, = 2X — NX} (25) 


In Eq. (25), EX?! may be evaluated as follows: 
DX? = D(X, — d) = xX: — 22Xd + zd 


As shown above on page 391, 2Xid = Xd*. Therefore, 


DXi = 2X? — Sd? 
and d 
Nei, = DX? — 2d? — NX} (26) 


However, it is true by definition that 
DX? — NX? = No} and xd? = Noi, 
Therefore, Eq. (26) reduces to the following: 
Noi, = Nei — Noi. (27) 


From Eq. (27), by dividing by N and then by oj and transposing 
terms, it follows that 


2 
xr SUA 3 
SE (28) 
and from Eq. (28) it follows by definition of 72, that 
Ih = = (29) 
Hence the square of the correlation index has the same significance 
as the square of the linear coefficient of correlation; it measures 


the proportion of the total variance accounted for by the assumed 
type of curvilinear correlation. 


CHAPTER XVI 
MULTIPLE AND PARTIAL CORRELATION 


To deal with the relationship between only two variables 
1e method of correlation so far discussed is useful, but in the 
ionexperimental sciences it is frequently and indeed usually 
more important to be able to deal with the association between 
three or more variables. In the social sciences in particular, 
variations in practically every factor are related to variations 
in several other rather than in a single other factor. For exam- 
ple, variations in the price of cotton are related not only to 
changes in the production and consumption of cotton but also 
to changes in the prices of substitutes for cotton such as rayon 
and, in addition, to changes in the value of money. Again, the 
consumption of a commodity such as gasoline may depend more 
upon the number of automobiles in existence and upon the 
number of miles of hard-surfaced roads available for use than 
upon the price of gasoline. As a matter of fact, it is dependent 
on all these factors and others too. In such cases it is essential 
to have some method of “multiple correlation” and “partial 
correlation." 

Definitions of Terms. Multiple Correlation. Multiple corre- 
lation is an extension to more than two variables of the methods 
of simple correlation. Simple linear correlation provides a line 
of regression from which an average value for the depend- 
ent variable may be estimated if the value of the independ- 
ent variable is given. Multiple linear correlation provides a 
“plane” of regression by means of which an average value for 
the dependent variable may be estimated if the values of 
two or more independent variables are given. The plane of 
regression of the price of cotton on the price of rayon and on the 
wholesale price level, for example, would permit the estimation 
of the former from joint knowledge of the latter, instead of from 
the price of rayon alone. Similarly, the plane of regression 
of the second-semester English grade on the first-semester English 
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grade and on the verbal scholastic-aptitude test grade would 
permit the estimation of the former from joint knowledge of the 
latter, instead of from only the first-semester English grade. 
The regression equation, accordingly, has two or more terms 
to the right instead of one; its general form is as follows: 


X! = ais... + bias... X, F bus... XL + °° ` 


where X; is the dependent variable, X», Xs, etc., are the inde- 
pendent variables, and a and the b’s are estimated parameters, or 
regression statisties, whose numerieal values are determined in 
any particular case by the method of least squares. The nume: 
ieal subseripts will be explained later. For the moment it only 
need be noted that a plane of regression is but the extension to 
more than two variables of the idea of a line of regression. 

In simple linear correlation, dispersion about the line of regres- 
sion of X; on X, serves as a measure of the accuracy of any 
estimate of X, made from the line of regression. In multiple 
correlation, dispersion about the plane of regression serves as s 
measure of the accuracy of any estimate of the dependent variable 
made by reference to the plane of regression. One of the essential 
problems of multiple correlation is to calculate dispersion about 
the plane of regression. 

In simple correlation, a line of regression is merely a law oí 
relationship between one variable taken as a dependent variable 
and another taken as an independent variable; it does not of 
itself describe the degree of relationship or association that exists. 
To measure the degree of linear association is the function of the 
coefficient of correlation. Since the coefficient of correlation 
measures the amount of linear association, it also serves as a 
measure of the goodness of fit of the linear-regression equation 
to the bivariate distribution and yields a measure of the general 
degree of accuracy of estimates made by reference to the regres- 
sion equation. In multiple correlation, the SE of multiple 
correlation serves the same general function. First, it serves 
as a measure of the degree of association between one variable 
taken as the dependent variable and a group of other variables 
taken as the independent variables. Hence, it also serves as a. 
measure of the goodness of fit of the calculated plane of regression 
and consequently as a measure of the general degree of accuracy 
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of estimates made by reference to the equation for the plane of 
regression. 

In simple linear correlation, relationships are completely 
described by two lines of regression, one in which X; is taken 
as the dependent variable and the other in which X; is the 
dependent variable. In multiple correlation involving three 
variables, there are three planes of regression. If four variables 
are involved, there are four planes of regression, and so forth. 
In general, there are as many planes of regression as there are 
variables that may be taken as dependent variables, in short, as 
many planes of regression as variables. In particular cases, the 
intuitive sense of cause and effect may lead to the rejection of 
some of these possible planes of regression as being without any 
practical significance. They must always, however, be consid- 
cred as theoretical possibilities. 

Where only two variables are considered, the coefficient of 
correlation between Xs, taken as dependent, and X;, taken as 
independent, is the same as the coefficient of correlation between 
X;, taken as dependent, and Xə, taken as independent. The 
measure of goodness of fit of the line of regression of X» on X: 
is the same as the measure of the goodness of fit of the line of 
regression of Xi on X» This cannot be said of the various 
multiple-correlation coefficients. The multiple-correlation coeffi- 
cient that measures the degree of association between Xi, 
dependent, and X; and Ka, independent, as a group and that also 
serves as a measure of the goodness of fit of the plane of regression 
of X, on X» and X; is not the same as the coefficient of multiple 
correlation that measures the degree of association of Xs, depend- 
ent, with X; and X;, independent, taken as a group and that 
also measures the goodness of fit of the plane of regression of X» 
on X, and Xy. Furthermore, neither of these two coefficients is 
equal, except by mere chance, to the coefficient of multiple 
correlation that measures the degree of association of Xz, depend- 
ent, with X; and Xs, independent, taken together and that also 
measures the goodness of fit of the plane of regression of X; on Xi 
and Xs. In multiple correlation, there are as many different coef- 
ficients of multiple correlation as there are planes of regression. 

Linear vs. Nonlinear Relationships. The simplest form of 
correlation analysis rests on the assumption that the association 
between the variables is of a linear type. In some cases, this 
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assumption does violence to the facts, the association being 
clearly of a nonlinear form. Where a simple form of nonlinear 
relationship exists between two variables, it has been found 
possible to fit a curve of regression instead of a line of regression 
and to calculate a correlation coefficient that measures the good- 
ness of fit of this curve. Whether such a simple curve can be 
fitted or not, it is possible to calculate a measure of nonlinear 
relationship, called the “correlation ratio,” that depends on a 
comparison of the variation about the means of the columns 
(or rows) of the grouped data with the total variation in the 
data.! 

Such devices as these can also be used when nonlinear rela- 
tionships exist among three or more variables. When the 
nonlinear relationship takes a simple form, it is possible to fit 
a curved plane or a surface of regression. A multiple-correlation 
index Ju za can also be calculated to serve as a measure of the 
goodness of fit of this surface of regression. Whether a simple 
form of a curved surface can be fitted or not, it is always possible 
to calculate a multiple-correlation ratio of the same sort as the 
correlation ratio for only two variables. Similar nonlinear 
relationships can also be carried over into the analysis of partial 
correlation. 

Partial Correlation. Partial correlation is concerned with a 
concept resulting from the faet that more than two variables 
are correlated; if only two variables are considered, there is no 
place for partial correlation. Where there are three or more 
variables, however, the question of the interrelationships between 
the variables becomes a part of the analysis. How much of the 
apparent association between two variables (X; and Xə) is due 
to their common association with a third variable (Xs) and how 
much to their direct connection or to some connection through 
other variables independent of X5? Would X, and X; continue 
to vary together if X; were held constant? This is the new 
problem that partial correlation attempts to solve. Fortunately, 
the methods employed in its solution are the same fundamentally 
as those involved in simple linear correlation. 

'This chapter is primarily concerned with linear multiple and 
partial correlation involving three variables. The notation 
involved in multiple and partial correlation will first be sum- 

! See Chap. XV. 
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marized. A brief discussion of a multivariate frequency dis- 
tribution, upon the basis of which any form of multiple or partial 
analysis must be based, will follow. Ensuing sections of the 
chapter will explain the fitting of planes of regression and will 
derive formulas for finding the numerical values of the regression 
istics of any given plane fitted by the method of least squares. 
Formulas for measuring dispersion and for calculating multiple- 
correlation coefficients will also be derived. Partial correlation 
will be explained in more detail, and methods of calculating 
partial-correlation coefficients will be indicated. In the next 
chapter the entire subject will be illustrated by an example. 

Notation. It is the practice in multiple- and partial-correla- 
lion analysis to let a symbol indicate the class to which a given 
quantity belongs and to denote by subscripts the particular 
number of the designated class. For example, if X stands for 
any variable measured in original units, X, indicates a particular 
member of this group and its subscript distinguishes it from 
X», Xs, ete., which are members of other groups. In a designated 
problem, X; may be the price of cotton, X» the price of rayon, 
and X; the general price level. Following is a summary of the 
various symbols used in the subsequent analysis, in which special 
attention should be directed to the subscripts: 


stt 


X; X, X, Variables measured in original units 

Xt, X5, X4 The estimated value of these variables given by the 
three regression equations in which the variables 
are taken as dependent. The primes distinguish 
them from the actual values of X;, Xs, and X; 

Va, To, Ly Variables measured from their means as origins 
(a, = X; — Xi, ete.) 

By, 25 2 The estimated values of 21, zs, and z; given by their 
regression equations and measured from the 
means of :X;, Xs, Xs (zi = X1 — Xi, ete.) 

X, X, X, Means of Xi, Xs, Xs 


01, 05, 03 Standard deviations of Xi, Xs, X 

Wi Le T3 Variables measured from their means as origins 

71 ey cs and expressed in terms of standard-deviation 
units 

xt xh af ai, x4, x, expressed in terms of the standard-devi- 

oi el ES ation units of Xi, X», X; 


E] 
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Xi = dios + bin sXe + baak: 
ie = dois + bs X) a= b23.1X3 (1) 
Xi = asas + bssX i + bs2.1X2 P 


These are the equations for the three planes of 
regression in which the variables are measured 
in terms of original units. The a’s and b’s are 
the regression statistics of the equations, of 
which the explanation follows: 

03.23 The constant term in the regression equation in 
which X; is taken as the dependent variable 
and Xs and X; as the independent variables 

asas and asas These are the constant terms, when X» and 
X5, respectively, are the dependent variables. 
The subscript before the point refers to the 
dependent-variable number; the subscripts 
after the point refer to the independent vari- 
ables. The order of subscripts after the point 
is immaterial; that is, gau = 02.5 

bis The coefficient of X; in the regression equation in 
which X; is taken as the dependent variable 
and X;is the other independent variable. The 
first number in the subscript indicates the 
dependent variable; the second number in 
the subscript indicates the variable of which the 
b is a coefficient; the point followed by the 
other subscript indicates that a third variable 
is considered. Similarly, b;sə is the coefficient 
of Xs in the same regression equation. It is to 
be noted that bus a 75 beis 

Ja ar The coefficient of X, in the regression equation in 
which X> is taken as the dependent variable 
and X; is the other independent variable; 
bssa is the coefficient of Xs in the same regres- 
sion equation 

bi», and ba a These have a similar meaning for the third 
regression equation 


ay = binata + Disats 
24 

Ta = basti + boss (2 
, 

2s = batap + D32122 
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Equations (2) are another form of the threeregres- 
sion equations. Here the variables are ex- 
pressed in terms of deviations from their 
respective means. In these equations there 
are no a’s, or constant terms, because the planes 
of regression all pass through the point given by 
the means of the three variables. The b’s are 
the same as those in Eqs. (1) 


+ Bis.2 = 


Kéi 


zs B231 L (3) 


Kë 
E 


d 
2 


Equations (3) give a third form in which the three 
regression equations may be written. Here 
the variables represent deviations from their 
respective means expressed in standard-devi- 
ation units [the zs are expressed in terms of the 
standard deviations of the z's (cj en Gs) 
instead of the standard deviations of the 2’’s 


/ 
x EN Si Za 
themselves]. The form is similar to amen 
1 2 


for two variables! 

In Eqs. (3), the 8's correspond to the b’s in the 
Eqs. (1) and (2). As may be seen by compar- 
ing Eqs. (2) and (3), the A’s are related to the b's 
in the following way: 


diss = 2.3 zs 

H Bis.s ES 

T2 

bra = Bors — 

91 

03 

dae = Daa 
B (4) 

b ga 

13.2 = fis — 

Kéi 

š bt 

bp taies 

T3 

e 

bsa = Bai = 

02 


1 See pp. 349-351. 
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Tf the symmetry of these equations is noted, 
they are easily remembered; for example, the 
subscripts of the b’s are the same as the sub- 

_ scripts of the 6’s, and the order of the first two 
subseript numbers describes the subseript for 
sigma in numerator and denominator, respec- 
tively. Itis to be noted that the 812.3 does not 
equal Baas ete. 


eum The scatter about the plane of regression of X; on X; 
and X; 

o213 The scatter about the plane of regression of X, on X, 
and X; 

eau The scatter about the plane of regression of Xs on X, 
and X; 

Ris; The multiple-correlation coefficient between X, on the 
one hand and X» and X; on the other 

Rosas The multiple-correlation coefficient between X» on the 
one hand and X, and X; on the other 

Rza The multiple-correlation coefficient between X; on the 
one hand and X; and X» on the other 

ria The partial-correlation coefficient between X; and X; 
when X; is held constant. The position of the sub- 
scripts is more important than the noneapitalization of 
the r in distinguishing it from the multiple-correlation 
coefficients. The subseript after the point indicates 
which variable is held constant. rs = Tisa 

řıs2 The partial-correlation coefficient between X; and X; 
when X, is held constant 

rat The. partial-correlation coefficient between X, and X; 
when X; is held constant 


Study of the symmetry in the above system of notation will 
make it easy to remember. With the exception of the notation 
for partial-correlation coefficients, the order of subscripts before 
the point is always significant; following the point it is always 
immaterial. 


MULTIVARIATE FREQUENCY DISTRIBUTION 


The monovariate frequency distribution, it will be recalled, 
is the basis for the determination of various measures describing 
the central tendency and variation about the central tendency 
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of a single variable. The bivariate frequency distribution 
(Chap. XIII) is the basis for the calculation of the lines of 
regression and the simple correlation r as well as for the caleu- 
lation of correlation ratios. In fact, the bivariate frequency 
distribution contained all the information regarding the joint 
variation or covariance of X, and X; and hence formed the basis 
for the calculation of any measure or law of relationship between 
these two variables, linear or otherwise. Similarly, the multi- 
variate frequency distribution contains all the information about 


Fro, 117.—A trivariate frequency distribution. 


the covariance of X;, Xs, Xs, ete., and it thus forms the basis for 
the calculation of any measure or law of relationship between the 
different variables, individually or in groups. 

Figure 117 shows a trivariate frequency distribution in which 
each variable is grouped into three class intervals. A small 
number of class intervals is taken in order to simplify the dia- 
gram; in any actual problem, the number of class intervals would, 
of course, be larger. : 

The figure shows the frequency (the number written on the 
floor of each cubical cell) with which X; falls within a given class 
interval at the same time that X falls within another given class 
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interval and X, falls within a third given class interval, Accord- 
ingly, in Fig. 117, the frequency with which X; takes on values 
between 100 and 200 at the same time that Xs takes on values 
between 1 and 2 and X; takes on values between 5 and 10 is 10. 
This is the frequeney of the joint occurrence of the specified 
Xi, Xs, and X, values, The frequencies in other cells repre- 
sent the frequency of joint occurrence of other Xi, Xs, and Xs 
combinations, 

If the frequencies of this trivariate frequency distribution are 
projected upon any one of the three reference planes, that is, if 
the frequencies are added from top to bottom, from left to right, 
or from front to rear, a bivariate frequency distribution is 
obtained for two of the three variables. For example, if the 
frequencies are projected upon the Eat: plane, the bivariate 
frequency distribution for these two variables shown in Table 43 
is obtained. In ‘Table 43 the frequencies of the trivariate fre- 


Tame 43 


quency distribution in Fig. 117 are added from top to bottom. 

If the frequencies are projected upon the X,X; plane, the 

bivariate frequency distribution of X, and X; shown in Table 44 
Tanu 44 


$* "EE e 
is found. To obtain the frequencies in Table 44, the frequencies 
in the trivariate frequency distribution (Fig. 117), are added 
from front to rear. 
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Finally, if the frequencies are projected onto the X,X, plane, 
the bivariate frequency distribution of X, and X, shown in 
able 45 is obtained, The frequencies shown in Table 45 are 
the sum from left to right of the frequencies in the trivariate 
irequeney distribution shown in Fig. 117. 


Tania 45 


EN x, 


1 D 
X, 15 10 5 o 


In the three bivariate frequency distributions shown in Tables 
13 to 45, it is to be noted that X; and X, are positively corre- 
‘ated, as are also X, and X; and X, and X, The given tri- 
variate frequeney distribution (Fig. 117) is one in which all the 
variables are positively correlated with each other, In this 
case, as the values of Xs and X, both increase, the mean value 
of X, also tends to increase; in, other words, the plane of regres- 
‘ion of X, on X; and X, would slope upward from the origin in 
both the X, and the X, direction, Because of the all-round pos- 
itive correlation between the variables, the other planes of regres- 
sion would also slope upward from the origin {n both directions, 

The net regression between two variables in a multivariate 
distribution is measured by the b statistic, and it is possible to 
have a negative net regression Diss although the Pearsonian 
coefficient of correlation ry is positive, and vier versa, If rs 
is small compared with ris and rs, the latter being either both 
negative or both positive, the plane of regression of X, on X, 
and X, may slope downward in the X, direction even if ra is 
positive, The statistic biss is of the 
ris — ruras is of the same sign as ris. Tf this condition is not 
fulfilled, that ix, if ra — rites and ru are of opposite sign, bisa 
will be opposite in sign to ris and the plane will slope in the 
opposite direction from that indicated by the sign of ris, which, 
when multiplied by the ratio s;/es, describes the slope of the 
line of regression in the bivariate distribution of X, and Xs.*.— In 

* See pp. 349-351. 
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the case where 71» is positive but bis.s is negative, the coefficient 
of partial correlation "aa is negative, agreeing with the sign of 
the b statistic. For this reason, the partial-correlation coefficient 
may be said to measure the net correlation between the two 
variables. 

If the net correlations between X, and X; and between X; 
and X; are both negative, the plane of regression of X; on X; 
and X; slopes downward in both directions. In this instance, 
the 515.5 and the bas of the regression equation are both negative. 
In other words, the mean value of X; would tend to decrease 
with increases in the values of both X» and X;. This particula: 
plane of regression would have an all-round negative slope. If 
the net correlation between X; and X; is negative, however, and 
that between X; and X; is positive, the plane of regression of 
X; on X; and X; slopes upward in the X, direction, that is, the 
mean value of X, increases as X» increases; and the plane slopes 
downward in the X; direction, that is, the mean value of X, 
declines as X; increases. In this instance, bis; is positive, 
and bis is negative. The plane of regression shows a positive 
relationship in one direction and a negative relationship in the 
other direction. 

These are a few of the sesso forms that a trivariate fre- 
quency distribution might take. Others include nonlinear 
relationships. F or example, the mean value of X, might first 
increase as X; increases and also as X; increases and then later 
decrease as both these variables continued to increase, or X; 
might decline in the X, direction after a certain point but con- 
tinue to rise in the X; direction. For either of these combina- 
tions, a curved plane or surface of regression would give a better 
fit than a straight plane. 

In order that there be all-round independence, that is to say, 
absolutely no correlation whatsoev: rer, either linear or nonlinear, 
between any of the variables, the tee conditions must exist: 

1. The distribution of X, for any given X» and X; class inter- 
vals, that is, the distribution of X, values for any given vertical 
shaft, must be of the same form, it must have the same mean, 
the same standard deviation, etc., even though it does not have 
the same number of cases, as the distribution of X, values for 
every other vertical shaft of the trivariate frequency distribution 
(see Fig. 117). 


Ka 
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2, The distribution of X» values for any given X, and X; 
class interval, that is, the distribution of X» values in any given 
horizontal shaft parallel to the Xs-axis and perpendicular to the 
X Xs plane, must be of the same form as the distribution of Xs 
values in every other horizontal shaft parallel to the X-axis 
(see Fig. 117). 

3. The distribution of X; values for any given X, and X; 
class interval, that is, the distribution of X; values in any given 
horizontal shaft parallel to the X; axis and perpendicular to 
the X X; plane, must be of the same form as the distribution of 
X; values in every other shaft parallel to the X;-axis and per- 
pendicular to the X X; plane (see Fig. 117). 

A close study of a multivariate frequency distribution is 
therefore always desirable before attempting to calculate any 
measure of relationship. Since in some instances the net corre- 
lation may be of opposite sign from simple linear correlation, as 
illustrated above, the examination of separate bivariate dis- 
tributions for each pair of variables is not always a reliable 
method. It is better to undertake a study of the multivariate 
distribution. In a trivariate problem a diagram similar to 
ig. 117 could be set up, but for a large number of class inter- 
vals it would. be extremely difficult, if not impossible, to draw. 
The multivariate distribution can be studied, however, by select- 
ing all the X, and Xs variates associated with a given range of 
X; variates, for example, in Fig. 117, all the X; and X» variates 
associated with values of X; from 5 to 10. In this manner, a 
series of frequency distributions of X; for varying values of X» 
is obtained. 


TABLE 46.—VaLUES OF X; AND Xs ASSOCIATED WITH VALUES OF 
X; BETWEEN 5 AND 10 


Xi 


X; 

Similar tables could be constructed showing the values of X 1 
and X, associated with values of X; between 0 and 5 and with 
values of X; between 10 and 15. In this manner, the net corre- 
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lation between X, and X; can be studied; and, by a similar 
procedure, the net eorrelations between X; and X; and between 
Xs and X; can be examined. If such a study should reveal that 
linear relationships prevail, the methods to be discussed in the 
ensuing sections could be applied. If simple curvilinear rela- 
tionships are apparent, some curved plane might better be fitted 
instead of a straight plane. In some instances, the latter could 
be accomplished by using logarithms, reciprocals, or some other 
transformation of the variables, to which linear functions could 
be fitted; in other instances, it might be necessary to fit parabolic 
functions to the original units. 


MULTIPLE LINEAR REGRESSION 


The a’s, b’s, and ës of a linear plane of regression are calcu- 
lated in terms of given data or of quantities easily calculated 
from the data. The 's can be evaluated in terms of the simple 
correlation coefficients, the r's; knowledge of these therefore 
permits the immediate calculation of the former. The b's can 
be computed readily from the 8's by multiplying by the proper 
ratio of standard deviations [see Eqs. (4)]. Finally, the a's can 
be computed from the b’s and the means of the different variables. 

The common method of evaluating the 8's is the method of 
least squares. It was pointed out that for three variables there 
are three planes of regression. Values for the regression sta- 
tistics in the regression equation of X; on Xs and X; are derived 
by minimizing the sum of the squares of the deviations of the 
actual values of X; from those (X1) given by the plane of regres- 
sion, that is, by minimizing ZO: — X1)*. Similarly, values for 
the regression statistics in the regression equation of X» on X; 
and X; are derived by minimizing the sum of the squares of the 
deviations of X; from those (X{) given by the second plane of 
regression, that is, by minimizing ZO, — X;)*. Finally, the 
values of the regression statisties in the regression equation of 
Xs on X; and X; are derived by minimizing E(X; — X/)?. All 
three planes of regression are thus fitted by the method of least 
squares, but in each case the sum of the squares of a different 
set of deviations is minimized. 

Using the third form of regression equation, the values of the 
statistics for the plane of regression of X, on X; and X; are 
derived as follows: 
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In the equation for the plane of regression, 
Ti Ta Za 
= = Bu; = + Kee 
[4] 6 02 Bis [25] 


the problem is to determine Suz and 813.2 such that 


2 
Ye CH » [s Biss = Bis.2 _ minimum (5) 


Since > (ë — a) is merely 
1 Ti 
1 > v = g 2 s 
E > IQ, — X9 Er Xi -32,05- y 


it follows that (X; — Xj)? will be a minimum when 


> zy cy 
g1 Oi, 


is a minimum, and hence the plane of regression derived by 
minimizing the latter is the same as that derived by minimizing 
ihe former. 
2 
If E: (s — fs. M = — iss 2 is to be a minimum, the deriva- 
02 03, 

tive of this sum with respect to Bra a must equal zero and also its 
derivative with respect to ıs.. must equal zero. These condi- 
tions are expressed in the following equations: 


Taf Tı we ei 0 

Y KO ( ox e M ` (6) 
T3 [ 21 Z] Lag 

2: T3 ( Ce T2 paa Si 


If in these equations the indicated multiplication is carried out, 
and if each equation is divided by N, they become 


> 1 Zrat 
KEE) Dare ATT _ 0 
12.3 — pisa 
ae Pet UM Nen H 
Dats 8 Etts Bus 2r 0 
A — P12.3 13.2 3 
Neen Novos oN 
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OE Exit ZXayts Zara 
But it will be noted that NAG Ti2; NIME Tis, Wows 23) 
>z S Drs x d Hefollowine 
aW Si and Wo Hence Eqs. (7) reduce to the following: 


Tia — Biz.s — Bissres = Ü (8) 
Tia — Biases — Bis.2 = Ü 


When solved by the ordinary method of simultaneous equa- 
tions, 


Tig — Data 


12.3 = 
g 1 — ris 
(9) 
Bus = Tu uris 
13.2 1- ri 


From Eqs. (9) it will be noted that, when rss = 0, Biss = ris 
and Biss = ris." 
If the other planes of regression are put in the form 


Ta Go Xa 

— = Bas — + BesT 

0» Ti 03 
, 

z; Ti Te 

— == Ba zk Bas 

ga [4 T2 


and the values of the 8's are determined in the same manner as 
the values of 612.3 and 13.2 were determined, the following results 
are obtained: 


1 — ris 
Bos = = ues i 
IM 
Tis — d ECH 
Baie = SES = 
are S ) 
Lage 


If the simple linear-correlation coefficients are known, there- 
fore, it is possible to obtain all the 8's that enter into the three 
multiple-regression equations; and the regression equations in the 
B form are thus determined. The other forms of the regression 


* See pp. 417 and 421. 
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equations can be derived from the 8 form by calculating the 
b/s from the Gs, and the a's from the b’s and the means. For 


example, Eqs. (4), such as bus = £15 2 will give the values of 


the b's. The regression equations in the form 
zy = baste + basta 


are then determined. If, in this latter form, Xi — X, is sub- 
stituted for ai, Xs — X, for zs, and X; — X; for zs, the equation 
hecomes 


Xi = be ES bissX2 s bis.2Xs + bizs Xo + bis.2Xs 
from which it may be seen that the value of o: an is as follows: 
diss = Xi — baala — bissXs (10) 


Similarly, the value of a for the other regression equations is 
found to be as follows: 


a213 = X, = baak: zm besiXs (10^) 
aam = Xs — bar2X1 — bai (102), 


it is helpful in the use of these equations to remember the 
symmetry in the notation, that is, the symmetry in the position 
of the subscripts. 

Second-order Variances for Linear Plane of Regression. The 
formulas derived below measure the dispersion of the individual 
items about the plane of regression fitted by the method of 
least squares. As in the simpler case of the line of regression, so 
also for the plane of regression, the mathematical procedure con- 
sists in finding the standard deviation of the deviations of actual 
X values from the estimated values (X’) given by the planes of 
regression. For example, by definition, c1, = Z(Xi — X 02/N. 
The task reduces to one of evaluating such expressions in terms 
of quantities already known, that is, the ze and the ge, This 
can be done as follows: 

Since it has been found easier to work with the variables when 
they are converted into deviations from their respective means 
and expressed in terms of their standard deviations, the formula 
for ei » will first be put in that form. This ean be done by sub- 
traeting X; from X; and adding it to X; and by multiplying both 
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numerator and denominator by oj, neither of which will affect 
the value of the expression. Thus, 


_ 2(X1 — XD: _ of (Xi — X) (Xi — KP 


3 
01,23 N oN 
SN 
o> MN CAI 
UN s pee (11) 
N UN 
where 
ih 
by Z 
d=2 -2 
01 0i 


The problem is to evaluate Xd. 
By Eqs. (3), the third form of the regression equation of X, on 
Xs and Xa, it follows that 
dom LARES 


Za T3 i 
= Biss — — Biss (12) 
91 o eu Eé 03 


Accordingly, for any given set of values of «1, zs, and a, ther: 
corresponds a particular value for d, which is the deviation o! 
the actual value of z, from the value of xi obtained by putting 
the given values of x2 and zs in the regression equation. There 
are just as many d’s, therefore, as there are different sets of 
values of z;, rs, and z, If any one of these d's is squared, 
Eq. (12) gives 


dës id Bias = d — Biss m (13) 
2 


gn: T3 


If all values of d are squared and summed, the following 
equation results: 


SE Ird EExd 
Ed MES Bias ae Das "x (14) 


But from Eqs. (6) and (12), it will be noted that 


Ss xr 

Taxed Er Í z) 3 EA 

us = E Daa Bis.2 =0 
Kë o 


3, 


Eé Eé 


and 


MULTIPLE AND PARTIAL CORRELATION 415 
Therefore, 
3 
zq = Ded (15) 
71 


The evaluation, therefore, of > at will solve the problem. 
1 


This can be done as follows: 
If each d, as shown in Eq. (12), is multiplied by the c/c: to 
which that d belongs, the following result is obtained: 


= Bias es (16) 
2 5 


> EE (17) 
eu ei " ees "5 0108 


Hence, dividing by N, 


xd zd Bx HR ECL ue Drws 
> — = — E — Bis.2 a 
N No, No 123 Noe Noirs 
But since 
221 e Erit Drws — 
= s EG AWO 18 
N Noos 7303 
it follows that 
yd? 
a = ] — fissris — B13. (18) 


and; finally, from Eqs. (11) and (18), 


A o xd? 
Tia — UN 


= eil — Bizaris — Bas.oras) (19) 


This gives an easy method for evaluating o? an when the r's 
and 's have been calculated. Similar formulas for evaluating 
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ij, and e2,5, the scatters about the other planes of regression, 
are found to be as follows:! 
giis = oll — Beats — Bos.i'as) (19^) 
921» = gäil — Bararis — Basares) (19^) 
Note the symmetry of these three equations. 
COEFFICIENT OF MULTIPLE CORRELATION 
The multiple-correlation coefficient measures the correlation 


between the dependent variable and the two independent 
variables taken together. For reasons previously indicated,* 


1 The dispersion oi a may also be calculated from the formulas: 
Tias = eiall — risa) and vis = (l — ria) (l — risa) 
This may be demonstrated as follows: From Eq. (23^), 
Tia + rior — Brutto 
@ =A) A) 


ise = 


and 


1 — rhe — rh Tiaria — Tis tats + 2rururss 
(= n3 — ris) 


1 rin = 
This gives 
(L = rà) — (ris + ris — Qriariares) 
(1 — rà) 
(ris — Tata) d (ris — Tiaras) 
qup > aS) 
Equation (9), however, shows that the two fractions on the right are equal 
respectively to 815,3 and fa, Hence 
(L = rig)(L — risa) = 1 — rupis — rufis, 
or on making use of Eq. (19), 
ia = eil — n — rna) 


In Chap. XIII (p. 321) it was shown that ct, = ail — r). Hence the 
last equation may be written 


(1 = rie) = riss) = 


sl ns 


Si = sall = risa) 
Thus both of the original formulas are derived from previously demon- 
strated relationships. Similar formulas hold for el and el un, These are 
sn säll = Tina) ° 
= o3(1 — rà) — risa) 
and 


a3 a(l — risa) 
= (l — rh) — riza) 


2 See pp. 351-353 and 365-368, 


2 
Kn 
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2 
Ka 
2 = 1.23 
Rios eu Es 


1 


may be taken as a good measure of multiple correlation. 

Ruas measures the degree of association between the X; 
variable and Xs and X; taken jointly. It can also be looked 
upon as a measure of the goodness of fit of the plane of regression 
of X, on Xs and X; to the set of X, values. For if the fit is 
perfect, ei an will be zero and hence Ris; will equal 1. Simi- 
larly, Rss measures the degree of association between the X» 
variable and X; and X; taken jointly, and Rais measures the 
association between the X; variable and X; and X; taken jointly. 
‘They also can be looked upon as measures of goodness of fit of 
their respective planes of regression. It will be reezHed-that all 
three of these multiple-correlation coefficients may have dif- 
ferent values. 

The multiple coefficient of correlation Rj.25 is always larger 
than or at least equal to 7;» and r13; for it stands to reason that X, 
can be estimated better (or at least no more poorly) from two 
variables Xs and X; than from X; alone or X; alone. Similarly, 
Rs.ı3 is greater than, or at least equal to, 7;» and r23; and Rs.12 is 
greater than, or at least equal to, ris and ze, Furthermore, 
R? 4, is equal to the sum of rj, and rj, if X» and X; are independent 
of each other; for, by Eq. (19), it follows that 


ail — Bast — Bisnis) (20) 


ei 


> Ties 
Rias = lc =1 
d ef 


= Bizi + Baas 


If X, and X; are independent of each other, rs; = 0; and, by 
Eqs. (9), Bisa = ris and Bisa = Tun, Accordingly, if ras = 0, 


Heo sorte ar Tis (21) 
Similarly, if ri; = 0, 

Riis = Ti + Tis (219 
and if r2 = 0, 

Riu = Tis de EE 


Consequently, by adding to the regression equation a second 
variable that is independent of the first, the accuracy with which 
the dependent variable can be estimated is increased by the 


H 
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amount of the correlation between that variable and the newly 
added variable. 

It should be noted that only in special instances can a definite 
sign be given the multiple-correlation coefficient, although it is 
usually assumed to be inherently positive. For, as was indicated 
above, ii may happen that the plane of regression to which a 
given multiple-correlation coefficient pertains may slope upward 
in one direction and downward in another direction, indicating 
a positive relationship between the dependent variable and onc 
independent variable and a negative relationship between thc 
dependent variable and the other independent variable. In 
such an instance, the correlation between the dépendent variable 
and the two independent variables taken jointly, that is, thc 
multiple correlation, cannot be said to be either positive o: 
negative. For such a multiple-correlation coefficient, no sign is 
attached. It is only when the dependent variable is positively 
or negatively correlated with each and every one of the inde- 
pendent variables that the multiple-correlation coefficient can be 
given a positive or negative sign. 


COEFFICIENT OF PARTIAL CORRELATION 


In the preceding sections of this chapter the discussion has 
centered on the problem of estimating the value of one variable 
from one or more other variables by means of a regression equa- 
tion. In connection with this problem, a coefficient measuring 
the degree of association between the dependent variable and the 
independent variables as a group was evaluated to show the 
accuracy with which such estimates can be made. This 
coefficient is a measure of the goodness of fit of the plane of 
regression. 

When there are interrelationships among three or more 
variables, another problem appears. It often happens that an 
apparent relationship between two variables is in reality the 
result of their individual connection with a third variable that 
commonly affects them both. For example, it may be that the 
correlation between the price of cotton and the price of rayon 
is due largely to the correlation of each of them with the index 
of wholesale prices. In other words, the concomitant move- 
ments in the prices of cotton and rayon may be due, funda- 
mentally not to any direct relationship between these two 
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competing commodities, but primarily to their common tend- 
eney to rise and fall with the general price level; they may be 
joint effects of a common cause. Similarly, the concomitant 
variations in first- and second-semester English grades of 
freshmen in a woman's college may be basically accounted for by 
their respective relationships to the grades attained by the same 
freshmen in verbal scholastic-aptitude tests or to their school 
records. 

The statistical device for discovering how much correlation 
there is between one variable and another variable when a third 
variable or a number of other variables are “held constant" is 
the method of “partial correlation.” The correlation between 
the freshmen grades in second-semester English, X;, and the 
freshmen grades in first-semester English, X», when the grades 
of the respective freshmen in verbal scholastic-aptitude tests, Xs, 
are held constant is the partial correlation between X; and X;. 
Such a partial correlation coefficient will show how much con- 
nection there is between grades in first- and second-semester 
English independent of their common connection with grades 
in verbal scholastic-aptitude tests. The coefficient of partial 
correlation, indicated in this instance as risa, will measure the 
degree of this independent association.* 

A variable is, of course, not held constant in any physical 
sense. It is not possible in any way ex post facto to change the 
fact that a Mount Holyoke freshman, who had grades of 160 
in first-semester English and 160 in second-semester English, 
had also a grade of 437 in her verbal scholastic-aptitude test; 
nor is it possible to change the fact that another Mount Holyoke 
freshman, who had grades of 120 in first-semester English and 
160 in second-semester English, had also a grade of 384 in her 
verbal scholastic-aptitude test. The ideal of holding constant 

1 The position of the point in the subscripts of ri». rather than the fact 
that it is a smaller letter, distinguishes it from is In the latter, the 
point comes after the first digit, setting off the two independent variables 
X, and X», jointly associated with the dependent variable Xi. In the 
coefficient of partial correlation, the point sets off the variable that is held 
constant coming immediately after the pair that are correlated, X, and 
Xa Thus, in ran, Xs is held constant while X, and X, are correlated; 
in za au, Xa, Xa and X, are held constant while X; and X; are correlated. 
The symbol Ris, by the position of the point, indicates a multiple-cor- 
relation coefficient between X;, dependent, and Xs, Xs, Xs and X; taken 
jointly as the independent variables. 
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one of the three variables is wholly a statistical idea. It consists 
in eliminating from each of the two variables between which 
the partial correlation is sought the effect of the third variable 
More specifically, the line of regression of X; on X is found, and 
the deviations of the actual values of X; from those given by the 
line of regression Xj = ais + bisXs are determined. These 
deviations from the line of regression represent the variation in 
X, that is left over after the linear effect of X; is eliminated 
Similarly, the line of regression of X» on X; is computed, and 
the deviations of the actual values of X» from those given by the 
line of regression X, = azs + bX; are determined. These 
deviations from the line of regression represent the variation in 
X, that is left over after the linear effect of X; is eliminated 
When these residual deviations in X, and X» are correlated, the 
result is the partial-correlation coefficient between X; and X. 
when X; is held constant, because the effect of Xs upon each o! 
them has been eliminated. 

To calculate a partial coefficient of correlation the extended 
calculations involved in computing two lines of regression and 
measuring the deviations of the actual values from them is not 
necessary. The coefficient of partial correlation can be alge- 
braically evaluated in a formula that makes it possible to compute 
it from the coefficients of simple linear correlation, as follows: 

The deviation of X, from the line of regression of X, on X, 


TA 91 
may be written as zí — z or zi — fıs, ts where the zs are 
3 


measured from their respective means. Similarly, the deviation 
of X; from the line of regression of X» on X; may be written as 


H 02 CTI 
4» — X5 Or X» — Tas g The standard deviations of these 


deviations from the lines of regression have already been deter- 
mined to be ei. and ø2.3, respectively. In accordance with the 
ordinary formula for a simple correlation coefficient, that is, 
T = Xrit»/Noi, the partial-correlation coefficient between X, 
and Xs, when X; is held constant, is, by definition, 


mas ZE. Elle — 23) 


Nos. 0s 


91 T2 
Er A 
RES Y 2t £s. (22) 


Noy 3093 
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if the numerator is expanded and the values for cis and cs. 
are substituted in the denominator, this becomes 
ya 


Ta 1 eg 
> Tuba T23 — > Tits — Tis — > Eata F rusas Ze" 
95 Géi o3 
Upon transferring the divisor Nee: from the denominator to 


IER E = — 
E Noi VI bes VI — rh 
each term of the numerator, this becomes 


Exit. 93 Etg ECT Zara op oun Lez 


23 Ee: 13 7 
No» go Noen 21 Noos 


IN ——O — 
p VA — rh V1 — rh 


in which ris, 13, ras can be substituted for their respective equiva- 
lent values, making the formula appear as follows: 


Tig — Testis — Data F Data 


12.3 


Vi = rh Vi = rh 


which reduces to 
(23) 


py gu u = 


Similar formulas for the partial correlation between X, and 
X; when X, is held constant and the partial correlation between 
Xs and X; when X, is held constant are as follows:* 


rua = ees (23’) 
PS V1 — ris VA — rh 
Tas — Dat d (23) 


From Eq. (23) it can be seen that if X; and X» are both uncor- 
related with Xs, that is, if ris and rss are zero, then ria = Tio. 
Similarly, if riz and rss are zero, Tis. = "1s; and if ms and ris 
are Zero, 723.1 = Tes. 


1 For the coefficient of partial correlation, the order of the numbers in the 


subscripts either before or after the point is a matter of indifference; that 
is, ries = rens and risas = 121.495, ete. It will be remembered that this is 


not true with respect to the order of the numbers in the subscripts before 
the decimal in the b’s and the 8's; that is, bs A bas fraa A Bas 


` 
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Any one of the following formulas, which can be readily 
derived algebraically [see Eq. (9)] may be used in place of, or as 
a check on, Eq. (23): 


maa = V B12.38013 (24) 


Tina = Biss SE ) (25) 


0102.3 Y 
Tis = Biss ——— (26) 

0201.5 

02.3 07 
maa = bisa —À (27) 

71.8 


Thus the partial-correlation coefficient can be calculated directly 
from the 8's or from the b’s and the dispersion formulas for the 
simple lines of regression, as well as from the simple r's. The 
equations for the other coefficients of partial correlation are 
symmetrical with Eqs. (24) to (27). 

The partial coefficients of correlation illustrated above are 
called correlation coefficients of the “first order,” while the sim 
ple coefficients r12, Ton, etc., are called “zero-order” correlation 
coefficients. If there are more than three variables involved so 
that a partial coefficient of correlation, 712.34, for example, is 
found, it is called a “second-order” coefficient of correlation; 
similarly, 715.545 is a “third-order” coefficient of correlation, etc. 
This classification is helpful in distinguishing different sets of 
correlation coefficients. The same terminology may be con- 
veniently carried over to the other statistics in a correlation 
problem. Thus, e; is a zero-order standard deviation, es is a 
first-order standard deviation, etc.; bis.s is a first-order regression 
statistic, hun a is a second-order regression statistic; ete. 


ANALYSIS OF VARIANCE IN MULTIPLE CORRELATION 


When a plane of regression, for example, X, on X, and Xs, 
is fitted to a trivariate frequency distribution by the method of 
least squares, variation in X, may be viewed as made up of a 
part that is due to its linear association with Xs, a second part 
that is due to its linear association with X;, and a third part 
that is due to association with factors independent of both X; 
and X; For the least squares Equation (6) show that 
Zo = 0 and Zzid = 0, which means that neither X, nor X; 
is linearly correlated with deviations from the plane of regres- 
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sion. In the ease of a normal trivariate frequency distribution, 
the independent variables are not correlaied in any way with 
the deviations from the plane of regression. Owing to the lack 
of correlation, the variance in the dependent variable is equal to 
the variance of the values given by the plane of regression plus 
the variance of the deviations from the plane. This may be 
shown as follows: 


zı = zt — (mm) 
and 


z} = (x|)? — äte — ll + (z: — 24)? (28) 


if all the deviations squared, like 27, described in Eq. (28) are 
added, the following result is obtained: 


Dat = D(x)? + Ste — vi)? — Aën — ll 


d 


e zt = > (x4)? + p (a1 — 21)? — Zeiss > (a — 21) = 
— 2:1.» D (9, = xi) 


? ES 
or, by substituting xi = (esa Ë + Bis.2 Si c; for the last x4, 


But, as just stated, the deviations xı — z are not linearly corre- 
lated with xə and zs, so that the cross-product terms are zero. 
Therefore, 


Dat = h(a)? + BG — 21)? 
If each term is divided by N, it is found that 
a= 9i; Tis (29) 


In Eq. (29), ei, may be further evaluated, as follows:! 


2 
2 z 
X (z1)2 = ` (aaa + Bis.s e) o? 
a D nai 
ci La, E + Bis. x z + 281».3813.2 >: aa) 


1 Since E = (0.8 + 81.2 z) cı. See p. 411. 
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If the above is divided by N and ai a», ci, and rss are substituted 
for equivalent values, the expression becomes 


ei, = 01 (Bix.3 + Biss + 2728812.9813.2) (30) 
By substituting this value for c2, in Eq. (29), it is found that 


o3(Bis.g + Bis.2 + 2resPr2.9813.2) + gan 


II 


or 


a + Biss + 2123812.3813.2 + zin (81) 


ei 


= 
D 


From the manner in which Eq. (31) was derived, it is known 
that the terms on the right side each represent a percentage oi 


Fic. 118.— Illustration of coefficients of direct determination, meaning of 27-3 
eross-produet term and the residual variance. 


the total variance of Xi. This may be interpreted in the follow- 
ing manner: 

Coefficients of Direct Determination. The first term of the 
right side of Eq. (31), bisa, may be interpreted as the percentage 
of the total variation in X; that is due to its direct association 
with Xa. It has consequently been called the “coefficient of 
direct determination" of X; by X; Similarly, 85, may be 
interpreted as the percentage of the total variation in X, that 
is due to its direct association with X;. Figure 118 depicts the 
coefficients of direct determination by arrows pointing from Xs 
directly to X; and from X; directly to Xa. 

Coefficients of Net Regression. The beta unsquared, fi» 
describes the change in X, in standard-deviation units that 
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accompanies a given change of X; in standard-deviation units, 
when Xsis constant. Geometrically, 612.3 is the SC of the line 


of intersection of a plane perpendicular to the = = ` axis with the 
plane of regression, 


ES 


SE a fus 


The statistic 812.3 has been called the “coefficient of net regres- 
sion” of X; on Xs in standard-deviation units. The coefficient 
of net regression in original units is 015.5. 

Coefficient of Joint Determination. The term 2rss812.3815.3 may 
be taken as representing the percentage of the total variation in 
*; that is due to the joint or combined effect of X, and X; 
resulting from the correlation between these two variables. In 
Fig. 118 the influence of X; on X, through its correlation with X» 
is depicted by the line aa’; the influence of X; on X, through its 
correlation with X; is depicted by the line bb’. Relationships 
along these lines indicate the significance of the r-8 cross-product 
term of Eq. (31). While variation in X; may directly affect 
X; it may also, through its correlation with Xs, bring about a 
change in X, and hence cause further variation in X, resulting 
from the connection between X; and X;. Similarly, variation 
in X, may affect X;, not only directly, but also indirectly 
through the association of X; and Xs. The term 2729812.3813.2 
may be taken to represent the combined indirect variation in 
X, resulting from variations in X» and X ;. 

Meaning of Residual Variance. The portion of variance in 
X, due directly to X» is 87,5; the portion due directly to X; is 
815.5; the portion due to the joint influence of X» and X; is the 
2r-8 cross-product term; the remainder of the variance is due 
to other factors not linearly correlated with X» and X;. This 
is depicted in Fig. 118.° The sum of all of these four terms is 
equal to the total variance, or, expressed as a proportion, to 1. 
The sum of the first three terms may be interpreted as the total 
portion of variance in X; that is due to its association with 
X, and X, jointly; the final term oi sai? represents the portion 
of the variance in X, that is due to its association with other 
factors not linearly correlated with X; and X, The sum of all 
these portions of the total variance of X; is necessarily equal to 1. 
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The Coefficient of Multiple Correlation. From the previous 
discussion regarding the interpretation of the simple correlation 
coefficient and the correlation ratio, it is to be expected that « 
similar interpretation might be made of the multiple-correlation 
coefficient, that is, that Ri aa represents the portion of the total 
variance in X, which is due to its joint association with X» and 
X; Equation (31) shows that this is actually the case; for i! 
will be recalled that 


2 

01.23 
== 
Kai 


Therefore, it follows by Eqs. (29), (30), and (31) that 


2 
KEN 


Rios = Biss Biss + raf aan = — (32) 


Kéi 


2 = 
vag — 1 


This interpretation of R?» is similar to that previously made of 


Th = P and "is = mi 
where oz, is the standard deviation of the line of regression. 

It has been noted that RÌ», can be interpreted as the portion 
of the total variance of X, that may be attributed to its joint 
association with X, and X;. It has also been shown that 
Ria = Buaatua + Bisaris [see Eq. (20)]. Consequently, it is 
possible to view Bissris as the portion of the variance of X, 
that is due to its total association, both direct and indirect, with 
X» and also to view Ba ans as the portion of the variance of X; 
that is due to its total association, both direct and indirect, 
with X;. Therefore, these two products have been called “ coef- 
ficients of total determination" of X; by X» and X;. 

When either of the products is negative, however,! it is pref- 
erable to resolve the expression into its equal, namely, 


Biss zs Biss + 2723812.3813.2 


1 The interpretation of the products as coefficients of total determination 
runs into difficulties in any particular case if either of them is negative. 
Hither Gu aus Or Biss ri; but not both may be negative, since their sum 
equals Ri which, of course, is not negative. To say that a variable 
contributes a negative percentage to the total variance of X, has no mean- 
ing, and consequently when either term is negative the interpretation is 
imaginary. 
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because 
ge EN Mis 
Rios = izid t Bisai = Baa Els Bia + 2rosB12.5P 13.2 


Expressed in the 8? and cross-product form, it is easier to under- 
stand why a negative value of Bu aus or of Bissris (but not of 
both) may occur; for whenever either Bi2.3r12 Or 13.2713 is nega- 
tive, it will be found that the joint contribution of X» and X; 
to the variance of X;, represented by 2r23812.3813.2, is also nega- 
tive. This follows because either rs; is negative or Bus a and 
Bisa are of opposite signs. In such a case, the direct effect of 
Xa, for example, on the variation of X; would be opposite in 
sign to its indirect effect, that is, through its correlation with 
Va; and the existence of this indirect link to X, through X; 
would tend to diminish the total variation caused in X; by 
changes in Xs. This dampening effect of a negative value for 
the r-B eross-produet term is what explains the existence of a 
negative value for either 815.571» Or B1s.2713- The form where they 
are squared also shows the difficulty in trying to assign to the 
variables Xs and X; an independent part in accounting for the 
variation in X,. When there is a joint contribution by the two 
variables, it becomes misleading to attempt to break it up and 
assign part of it to one and part to the other, as the foregoing 
interpretation based upon the form Ga as + 813.2713 appears to 
do. 

As noted above, the part of the variance, oj, that is due to 
correlation between X, and X, and X; together is described as 
follows:! 


E, EE 
WË Risi 


The right side of this expression describes the variance of the 
plane of regression, just as "ig? describes the variance of the 
line of regression. This part of the variance in the X, variable is 
made up of two parts that may be analyzed as follows: 


2 2 
cias = a(l — "all — "aal 
T 2 2 (2 2 2 2,2 9 
ene oe SE ia 
2 
= nu T eh v "alias 
2 
= ej — lii — 01.218.2 


1 Cf. Eqs. (29) and (31). 
? See footnote, p. 416. 
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Therefore, by transposing terms, 
gj = risi + Ciise + los (33) 
It follows from Eqs. (33) and (29) that 
ei, = rhet Lei e 


in which c? is the scatter about the line of regression of X; on X». 

It is thus seen that the variance ozy of the plane of regression 
consists of two parts: (1) the variance due to the total linear asso- 
ciation between X, and X», which is rj,oj; and (2) of the remain- 
ing variance about the line of regression (sj), the portion that 
is due to the influence of X; not already included in r12, namely, 
Tisia» The partial coefficient of correlation rss describes the 
relationship between X; and X; when X; is held constant; in 
other words, it is the net correlation between X; and X;. The 
square of this partial coefficient of correlation is therefore the 
coefficient of proportional variance in X; due to net correlation 
with X;. If to variance due to total association with X; is 
added the variance due to net correlation with X;, the result is 
the variance due to total correlation with X, and X, jointly. 
The remainder of the variance, as Eq. (33) indicates, is due to 
other causes not linearly correlated with X, and X;. 

Analysis of Variance and Causal Relationships. Where other 
knowledge suggests that causal relationships run only in one 
direction, the preceding analysis takes on considerable signifi- 
cance. In biological investigations, for example, where the 
effect of heredity is being studied, it seems logical to assume 
that variations in parents cause variations in offspring and that 
the causal relationship does not run the other way. Again, in 
certain economic problems, it is to be assumed that fluctuations 
in weather conditions bring about changes in economic con- 
ditions, but that the latter have no effect upon the former. In 
one-directional setups of this kind, the 8’s take on the full signifi- 
cance of the connotation “coefficients of determination.” The 
8"s measure the amount of variation in the dependent variable 
caused by fluctuations in each of the independent variables 
separately, and in conjunction with the other variables in the 
2r-8 cross-product expression. It is in this type of problem that 
correlation analysis attains its greatest significance. 


€ 
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Where there is no reason to believe that causal relationships 
;e unilateral, the interpretation of the results of correlation 
analysis in terms of causal determination becomes unscientific. 
In most problems there is interaction between the variables 
rather than a strictly one-directional association. It is still 
possible to estimate what is called the dependent variable from 
knowledge of so-called independent variables, but the latter 
must not be looked upon as determining the former. With 
reference to a regression equation used for purposes of estimation 
and prediction, the 6”s and the 2r-8 cross-product terms are 
eful in showing how important certain factors are, separately 
in conjunction with other factors, in making estimates or 
cdictions of the dependent variable. They become coefficients 
of determination only in the sense of statistical determination or 
estimation and not in any sense of physical, biological, or eco- 
nomie causation. 

Of incidental importance to method and theory but of con- 
siderable importance to practice, Eq. (31) provides a cross check 
on the arithmetical work for finding most of the statistics in the 
multiple-correlation problem. 


EXTENSION OF MULTIPLE- AND PARTIAL-CORRELATION 
ANALYSIS TO FOUR VARIABLES 

The foregoing analysis, which has pertained to only three 
variables, may be extended to cover a greater number of vari- 
ables. In this section its extension to a four-variable problem 
will be discussed. In the next section its extension to any desired 
number of variables will be considered. 

When there are more than two independent variables in a 
regression equation, the number of 8 coefficients te be determined 
is correspondingly increased. If there were four variables, 
for example, that is, three independent variables, the 8 form 
of the regression equatioh would become 


n 
ti Za Ta 4 Ta 34 
: 13.24 Bos (34) 
a Ba vi ARS = en 


If this regression plane is fitted by the method of least squares, 
the 8's can be determined in terms of ihe r's in the manner 
described for the three-variable problem; but, there being three 
equations to be solved for three 8's, the solutions are not so simple 


a 
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as those given by Eqs. (8) and (9). Some special method of 
solution may be employed, such as the so-called “Doolittle 
method" of substitution or some determinant method.! 

The chief advantage claimed for the Doolittle method is that 
it provides a check at each step in the problem; but experience 
for several years with its use for four-variable correlation prob- 
lems has demonstrated its complexity and sensitivity. For a 
multivariate problem of more than four variables both the 
Doolittle work sheet and the determinant methods become 
inereasingly cumbersome. 

A saving in computation is obtained by using formulas based 
upon the algebraic evaluation of correlation statistics. This 
saving was demonstrated for the trivariate case in the first part 
of this chapter where it was found that the ës could be evaluated 
in terms of the zero-order >s [see Eq. (9)]. In addition, however 
it is possible by symmetry to extend the algebraic results of 
some of the formulas to apply to a multivariate correlation of 
as many variables as desired, in effect to evaluate algebraically 
the correlation statistics for the general regression equation. 


x= Aiit. n + ba... X, + bn: Xr+ dc baam, a (35) 


First the correlation statistics will be algebraically evaluated 
for the four-variable case, which is done as follows: 

The normal least-squares equations for fitting a plane of 
regression of X; on X», Xs and Xy, when divided by N, are 
a8 follows:? 


Zr. Ir Dlow: Irr 
7 = Daa — + 13.24 3 — = 
Nois, Nog * Dan Noag + Bus No, 
v Ki Dy? 
Zus Etats 2: Zar: 
ee ee Bisa RT. + 13.24 5 Se? QUELS 
Noos Mee | 9824 Ng Bs Noss, 
Zaun, Etta Irr. Dr? 
SEENEN Z $04 at 
d XY 15.94 ect + i "$———- + — 
Nois, Bras Novos Basa Nos, Pian No? 


These may be written, by substituting for their equivalent 
2 
values, riz, 03, Tos, T24, Tis, 02, Sat Ti, 02, as follows: 


‘In theoretical work the determinant method gives neat and easily 
remembered solutions. In practical work, however, many statisticians 
prefer the method of substitution. 

2 See pp. 411-412, 
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Ti» = Daag Tossa + Toa 14.28 
Tis = T23819.84 + Baas + 13481408 (36) 
Tia = Taina ang + Ts4813.24 + Bisco 


If the first of Eqs. (36) is multiplied by re; and the result is 
subtracted from the second, f12.31 is eliminated and it is found 
that 


Tis — rieg = (1 — 733)B1s.24 + (rsa — raatalëu a 
or 


Tis — Tisro23 


Taq Tagtau 
27 Bos 
1 — rà 


= func - 


1 = rh 


which, by Eqs. (9), is equivalent to saying 
Bis.2 = Bis.24 + Bas.28 11.23 (37) 


Similarly, if the first of Eqs. (36), multiplied by r24 is subtracted 
from the third, 812.31 is eliminated and it is found that 


Bue = Bios + Gaata (38) 


Correlation Statistics in Terms of Lower-order Correlation Sta- 
tistics of the Same Kind. From Eqs. (37) and (38), solved simul- 
taneously, it is found that 


_ Bis.2 — B34.2813.2 ç 
Bun Ce sf e 


All the other second-order 6’s may be evaluated in a similar 
algebraie manner, or written by symmetry; for example, by 
symmetry with Eq. (39), the 813.34 is as follows: 


Bis.2 — Bassia AC 
1 — fissis ( H 


Equations (39) and (40) are equivalent to the following expres- 


Bis.04 St 


d ; d 74 94 EN 
sions in terms of the He, because, if bis 77 bsa n? and similar 


values of 8's in terms of corresponding He and standard-deviation 
ratios are substituted, the standard deviations cancel out.’ 


1 ince the order of digits in the subscript after the point is immaterial 
in both the b and 8 statistics, the following formulas may be used as checks 
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` bas — bsasbiss ao^ 
biios = BENE NE (39) 
z biz — ba zhtas ` 
UE EE (40’) 


Correlation Statistics in Terms of Lower-order r's and es It is 
possible to express the above formulas in terms of lower-order 
rie and o’s if this method of calculation is preferred. It was 
noted in Eq. (26) that 


01 02.3 
Mes = [Bins — — 
9201.8 
and, by symmetry, it follows that 
9104.2 
Tus = Dua — 
9471.2 


Also, by Eq. (27), 


74.2 
Tus = bus 


Consequently, 


Bias = Tus — — 
bua = Tus —— 


If these values of the 8's in terms of r's and standard deviations 
are substituted in Eq. (39) and the values of the b's in terms of 
the r’s and the scatter are substituted in Eq. (39^), the following 
results are obtained: 

Ti2.a — 142.3% 14.3 02 01,3 


Ban = ——— (41) 


5 gs 
l> ris een 


bua = — 3543713 (41^) 


Correlation Statistics in Terms of Correlation Statistics of Same 
Order. The preceding formulas may be transformed to still 


on the ones given in the text: 


= Bus Dass 
EES 
bisa = ba. sbtäs 


binas = SES (39°) 


(39) 
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another form in which the correlation statistics are expressed in 
terms of other correlation statistics of the same order. Thus, 
since 914 = 91,3 V/I — rlua and 02.51 = 02,3 wl — 734.5, it follows 
that 8 


01.84 
vA — Tiss 
02.34 


2.3 — ——— 
2 
vi — This 


i! these values are substituted in Eqs. (41) and (41^, the follow- 
ing results are obtained: 


C Tias 


02 01.84 

Biss = Dan (42) 
0102.34 
01.34 

bissa span (42^) 
02.3 


Similar algebraic procedure will show that all the other 8's 
and b's have formulas that are symmetrical with the above. 
For example, 


Bu = Tus ZN 
9103.4 

aa = Daa TUM 
03,34 

Bios = 14.28 focus 
010423 


71.23 
bias = 7123 —— 
04.23 


Partial Correlation in the Four-variable Case. If there are 
four variables X;, Xs, Xs, and X4, the partial-correlation coeffi- 
cient, between X, and X, when X; and X, are held constant, 
that is, 71544, is defined as the simple correlation coefficient: 
between the deviations of X, from the plane of regression of X; 
on X; and X, and the deviations of Ka from the plane of regres- 
sion of X; on X; and Ku, Accordingly, partial correlation, when 
four variables are involved, is no more complex algebraically 
than when three variables are involved; both are simple linear 
correlations between residuals, in the three-variable problem 
the partial coefficient of correlation is found by correlating the 
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residuals about lines of regression; in the four-variable problem 
the partial coefficient of correlation is found by correlating the 
residuals about planes of regression. The formula for the 
second-order partial coefficient of correlation ri2.s4 is obtained 
in the same algebraic manner as for the first-order partial 
coefficients of correlation [see Eqs. (22) and (23)]. The formula 
is 


Lë — Tú 
Tun = SS — 


vA — ra vi 


This is a partial coefficient of correlation of the second order. 
Tt will be noted that the formula runs in terms of the correlation 
coefficients of the first order. To make use of this formula in 
practice, therefore, it first becomes necessary to calculate the 
zero-order correlation coefficients and then the first-order cocíli- 
cients before the second-order coefficients can be determined. 
The calculation of higher-order partial-correlation coefficients is 
thus similar to the calculation of those of lower order, but thoy 
require additional work. 

Third-order Variance and Multiple-correlation Coefficient. 
The formula for third-order variance in the four-variable prob- 
lem is obtained by adding a term to the formula for scatter in 
the three-variable problem, as follows: 


= (43) 


7 
— Ti 


2 2 
71.934 = ell — Basar — Biscaris — Bi4.23r14) (44) 


The equation for the multiple-correlation coefficient is, by 
symmetry, as follows: 


2 
Bh = 1 — tH (45) 


or 
2 
Ris = [insanis + Dua auto + Busesi 


EXTENSION OF MULTIPLE- AND PARTIAL-CORRELATION 
ANALYSIS TO ANY DESIRED NUMBER OF VARIABLES 
For a three-variable and even for a four-variable correlation 
problem, it is probably desirable first to calculate the 8’s from 
the lower-order 8's. In the three-variable correlation problem, 
this means calculating the 8's from the zero-order r’s, which at 
that level correspond to the 8's. 
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l'or more than four variables it may be better first to calculate 
the higher-order r's from the lower-order 7’s and thereafter 
obtain the b and 8 statistics. Following is the extension to the 
general multivariate problem of the various SA showing the 
series of formulas in some cases from the simple correlation to 
the multivariate in order to depict how they are readily obtained 
by symmetry. 

Extension of Formulas to General Multivariate Problem. 
Statistics for the Regression Planes. The b and 8 statistics are 
evaluated, in general, by the following formula: 


(a) [gr 
by 


or 


This formula ean be used when the 7’s and g’s of the same order 
have already been obtained, but it does not provide for the cal- 
culation of the b’s from lower-order statistics. 

In the simple linear-regression equation, carrying out the 


" bg 
instructions of the above formula gives biz = ris oat If there 


: : 01.3 
are three variables, it becomes bas a = 112.3 = ete. 


The relationship between the b's and the ës is not affected 
by the number of variables and may be summarized as follows: 


(b) Bis = bios — 
Biza = ba 
Bis.2 = 0122 — 


= bisu — 
Bis.24 want 


oF 


Bon, on = ban. ^5 


The a's for the various planes of regression are found d formulas 
such as the following: 


. —_ 


d Š Ec 
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(c) 1,234 = Xi E bizs X iF bissiXs GC? bank, 
03.134 = Xo — barsa X 1 — DoscaXa — boiisX4 


lijk.. n = X Dijk ane bänk © + boa. Ën 
High-order Variances. The higher-order variances may be 
estimated by using either the 8 formulas or the partial r formulas: 


(d) Grave = e(l Daat Bis.saris Bassia) 


Tein TEE EE Bina. rin) 


and 


oi. = ét — ris) 
ai zeit — ris)(l — risa) 
ga = oill — alt — Saal — rissa) 


OF jet. am CS s; Ca ri) = Tiki) aksu (Q= Tin. jkt. al 

The Multiple-correlation Formulas. As in the case of the 
relationship between the b's and the ga, the formula for the 
multiple-correlation coefficient is of the same form regardless 
of the number of variables; it is always a comparison of the 
residual scatter about a plane of regression with the total standard 
deviation. In the four-variable and the general multiple-variable 
cases it is as follows: 


2 [ii 234 
(e) Risu = 1 — = 
91 
2 gu 
= GEI 
Ria. 1: — o 
o 


The Partial-correlation Formulas. The forms of the formulas for 
the partial coefficients of correlation are also independent of the 
number of variables in the problem, because the partial correla- 
tion coefficient always pertains to a simple linear correlation. 
The formulas are therefore symmetrical, as follows: 

Second-order partials: 


T12.3 — 114.37 24.3 
ç vi > Tiss vA — risa 
And in general, the formula for calculating partials of a given 
order from the next lower order partials: 


Ta = 


Toto, n—1 —Tin.ko . i. n—1 Tinko... n—1 


Tijko...n = = 
— y? 
NVI riw Vv Leet 


CHAPTER XVII 


ANALYSIS OF A MULTIVARIATE FREQUENCY 
DISTRIBUTION ILLUSTRATED 


To illustrate the analysis of a multivariate frequency dis- 
tribution, data on grades of 81 freshmen in a woman’s college 
were selected. These are arranged in Table 47 in such a way 
as to facilitate the construction of simple linear-correlation tables 
and to facilitate detailed study of the multivariate distribution. 
The first part of this chapter will illustrate trivariate-frequency- 
distribution analysis. The X, variable is included in the table 
so that later in the chapter the analysis can be extended to four 
variables. Beyond four variables, the method proceeds in a 
symmetrical fashion. 

Examination of Multivariate Distribution. The first step in 
the analysis of a multivariate distribution is to determine how 
well the assumption of linearity of relationship is approximated. 
The scatter of cases in the correlation table for X, and X> 
appears to indicate simple linear regression between X, and Xs. * 
In Tables 48 and 49, correlation tables for X; and X; and for 
X, and X5, respectively, the scatter of cases suggests that these 
regressions might be only slightly if at all nonlinear. As pointed 
out in the preceding chapter, however, the simple correlation 
charts cannot be expected to reveal how much net regression 
exists in a trivariate correlation problem; accordingly, further 
study of the trivariate distribution of X;, Xs, and Xs, should be 
undertaken. In order to do this the multivariate distribution 
itself, shown in Table 47, must be studied. 

For example, the net regression of X2 on X; may be tested 
in a preliminary fashion by holding constant the X, variable. 
Accordingly, analysis may be made of the X: and X; grades of 
the 16 students whose X, grades are 200. Table 50 shows the 
X, and X; grades of these 16 freshmen, selected from Table 47. 


* See Table 34 (p; 358). 
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TABLE 47.—Gnapzs or 81 Mount HOLYOKE FRESHMEN IN VERBAL Sun, 
LASTIC-APTITUDE TEST, IN COLLEGE BOARD ENGLISH EXAMINATION, AND IN 
FIRST- AND SECOND-SEMESTER ENGLISH 
(A, B, C, and D grades converted to a numerical scale) 


H = = š College Board 
Student "eic ade | English grade Be ; 
1 240 220 407 
2 200 180 456 
3 260 240 615 
4 260 260 671 
5 160 160 546 
6 240 220 615 
7 220. 200 449 
8 60 120 428 
9 220 240 726 
10 200 180 504 
11 220 220 601 
12 140 180 449 
13 160 120 553 
14 240 260 546 
15 260 240 671 
16 200 160 553 
17 200 160 449 
18 240 240 532 
19 240 220 574 
20 240 220 650 
21 160 140 532 
22 220 240 518 
23 200 200 477 
24 100 100 518 
25 160 140 366 
26 200 160 449 
27 180 180 671 
28 180 160 539 
29 240 240 532 
30 200 200 546 
31 200 200 518 
82 180 160 518 
33 260 220 518 
34 160 120 560 
35 240 240 539 
36 220 220 587 
37 220 240 622 
38 160 120 532 
39 200 200 643 
40 220 220 453 


ANALYSIS OF A MULTIVARIATE FREQUENCY 439 


TABLE 47.—Gnapzs or 81 Mount HOLYOKE FRESHMEN IN VERBAL SCHo- 
LASTIC-APTITUDE TEST, IN COLLEGE Boarp ENGLISH EXAMINATION, AND IN 
FIRST- AND SECOND-SEMESTER ENGLrsH.—(Continued) 


= - = College Board 
Student Qai unde | Dees | O SER | nee, 
x Xs Xs X4 
41 260 260 646 643 
42 180 160 511 ` 532 
43 100 60 339 442 
44 | 200 220 424 . 594 
45 200 200 515 594 
46 160 120 Am ^. 525 
4T 180 = 160 4187 504 
48 280 220 581 684 
49 200 200 479 560 
50 220 220 579 532 
51 220 200 600 594 
52 240 220 610 497 
53 100 60 376 435 
54 220 220 458 484 
55 240 200 600 497 
56 200 220 489 539 
57 220 220 507 ` 525 
58 220 200 513 636 
59 240 200 644 581 
60 180 140 521 401 
61 160 140 345 442 
62 240 220 596 636 
63 260 260 667 567 
64 160 120 384 546 
65 260 240 646 511 
66 220 180 500 504 
67 220 240 613 608 
68 260 260 707 615 
69 240 220 703 657 
70 200 200 391 497 
71 140 120 416 442 
72 260 * 240 667 657 
73 200 180 477 664 
74* 300 280 709 518 
75 180 140 527 539 
76 220 180 475 629 
77 180 180 636 594 
78 300 280 543 726 
79 220 220 432 E 
80 220 200 380 a 
81 200 180 479 589 
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"TABLE 50—X. AND X; Grapes or 16 Srupents WHose X; Grape Is 200 


EE SCC 
2 180 473 
10 180 437 
16 160 594 
17 160 495 
23 200 443 
26 160 535 
30 200 504 
31 200 431 
39 200 619 
44 220 424 
45 200 515 
49 200 479 
56 220 489 
70 200 391 
73 180 477 
81 180 479 


Figure 119 shows the scatter of these 16 bivariates; it indicates 
that the regression may be linear although there appears to be 
little tendency for X; and X;, unaffected by Xj, to be correlated 


160 180 200 220 240 280 
380 — I I 
—— 
1 
400 | 
1 2 1 
450 _ 
1 3 
500 .| š i 
1 2 
550 . | 
1 
600 
li 1 
Xs 


Fre. 119.— Net regression of X» on Xs, holding Xi constant. Test based on 
X; — 200. 


with each other. Evidently violence is not done to the facts 
by assuming that any correlation present is linear in character. 
It must be remembered, of course, that this preliminary test of 


D 
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net regression between X, and X;, holding X, constant, is based 
on a small sample of only 16 observations. Similar tests, holding 
X, constant at several other values, respectively, should be 
made, especially if nonlinearity is suspected. 

Inasmuch as Table 34 (page 358) shows such a clear linear 
total regression between X; and Xa, it may be assumed that the 
net regression of X; on X», and vice versa, is linear. For further 
ilustration of the method of examining the multivariate fre- 
quency distribution to test for linearity of regression, Table 51 
presents the joint variation in X; and X; grades of those students 
whose X» grade is 220. This will make it possible to test the 
linearity of net regression between X; and X;, holding X; 
constant. 


‘Taste 51—X, AND X; Grapes or 18 Stupents Wnosg Xs Grape Is 220 


SE x 
1 240 573 
6 240 567 

11 220 443 
19 240 598 
20 240 536 
33 260 509 
36 220 531 
40 220 456 
44 200 424 
48 280 581 
50 220 579 
52 240 610 
54 220 458 
56 200 489 
57 220 567 
62 240 596 
69 240 703 
79. 220 432 


The bivariate frequency distribution of the eighteen cases in 
Table 51 is plotted in Fig. 120, which shows that their scatter, 
with the exception of one case, follows a linear path. It may be 
concluded that the net regression between X; and X; approxi- 
mates the linear form. Here again, especially if nonlinearity is 
suspected, similar tests using several different values of Xo, 


3 
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respectively, should be made. Tables 50 and 51 serve only to 
illustrate the method, which would ordinarily need to be applied 
more completely than is done here. 


200 220 240 260 280 300 X, 
I | 
400 _ 
1 2 
490 __ I 
1 2 
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3 1 1 
550 _ 
5 1 
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650 _ 
700 H 
NEC TOU EE EE 
Xi 


Fig. 120.—Net regression between X; and Xs, holding X: constant. Test 
based on X2 = 220. 

Statistics of the Trivariate Frequency Distribution. Study of 
the trivariate frequency distribution X;, X», and X5, shown in 
Table 47, appears to indicate that regressions are linear, and 
therefore it may be assumed that the methods of correlation 
outlined in Chap. XVI may appropriately be applied. In order 
to calculate the statistics for a trivariate frequency distribution, 
it is necessary to obtain first the statistics for the various mono- 
variate distributions. 

Calculation of Zero-order Statistics. Correlation tables 48 
and 49 may be used as work sheets, as illustrated in Chap. XIV, 
to ealeulate the zero-order statisties. Since the problem is now 
one of analyzing a trivariate frequency distribution, it is well 
to set up a schedule for calculation of the respective statistics. 
Table 52 is a schedule of the means and variances for the three 
monovariate frequency distributions taken separately. In each 
case the mean was calculated by using the formula 


xe), 


 — 
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TABLE 52.—MEANS AND VARIANCES 


| 
Means | Sums of squares Variances ae, 


Nei = 156,355.56 |o} = 1,930.3155 | = = 43.94 
No} = 181,155.56 |o? = 2,236.4883 | os = 47.29 
Nei = 719,222.22 = 8,879.2866 | = 94.23 


‘The sum of squares of deviations from the mean in each case is 
calculated by using the formula 


( dl 
S D dY D 
z = xdi Leg z 
No? = + F (’) N 
The required sums are all found in the correlation tables, for 
example, 
" (111)* 
No? = 400 | 543 81 400(543 152.11111) 
= 156,355.56 


Table 53 is a schedule for the caleulation of zero-order Pear- 
sonian coefficients of correlation, using the equation that was 
used in Chap. XIV. This equation is shown at the head of Table 
53. The entries all come from Tables 48 and 49 and Table 34 


(page 358). 


TABLE 53.—CALCULATION OF SIMPLE ris 


OOOO 
"OE 


a) (2) | 3) | (4) (5) 
s {aN (a Dy rt AWA — e: 
YO OOOO) 0-0 | eco 
455.00 78.11111 | 420.74964 | 376.88889 | +0.89576 
283.00 —68.51852 | 558.90486 | 351.51852 | +0.62894 
322.00 —35.18519 | 601.59915 | 357.18519 | +0.59373 


Calculation of First-order Statistics. As suggested in Chap. 
XVI the first-order statistics of a trivariate frequency distri- 
bution may be calculated by several methods. The most efficient 
method appears to be to calculate the first-order 8’s and from 
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the first-order ës to calculate the other first-order statistics. 
Table 54 is a work sheet for the orderly calculation of the first- 


TABLE 54.—CALCULATION OF THE FIRST-ORDER ds FROM THE 
ZERO-ORDER ai 
D Tij Turik 
D'Éier 
1 re 


[See Chap. XVI, Eqs. (9)] 


a) (2) (3) (4) (5) 
Zero-order r (8) First-order 8 
Produet | Whole CA 12790801 NA 
N term of numerator EA R n" 
Subsoript | Regression | numerator Subscript | Regression 
12 0.89576 | 0.37342 | 0.52234 | 0.64748 12.3 0.80673 
13 0.62894 
23 0.59373 
13 0.62894 | 0.53184 | 0.09710 | 0.64748 13.2 0.14997 
12 0.89576 
23 0.59373 
12 0.89576 | 0.37342 | 0.52234 | 0.60444 21.3 0.86417 
23 0.59373 
13 0.62894 
23 0.59373 | 0.56338 | 0.03035 | 0.60444 23.1 0.05021 
12 0.89576 
13 0.62894 
13 0.62894 | 0.53184 | 0.09710 | 0.19762 31.2 0.49135 
23 0.59373 
12 0.89576 
23 0.59373 | 0.56338 | 0.03035 | 0.19762 32.1 0.15358 
13 0.62894 
12 0.89576 


! Note the internal checks in columns (2), (3), and (4), in which each of three values 
occurs twice; in column (2), the first and third, second and fifth, fourth and sixth figures 
check; in column (3), the same orders check; in column (25, the first and second, the third 
and fourth, and the fifth and sixth figures check, While not independent checks, they 
nevertheless give confidence in the accuracy of the work as it proceeds. 

If preferred, the b's instead of the 8's could first be calculated, by using a similar table 
and the general formula 

oe, bii — bixbri 

ba = GEES 


order Ge in the illustrated trivariate frequency distribution. 
The entries in column (1) of the table are obtained from Table 53. 
Bearing in mind the symmetry in the formula shown at the head 
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of Table 54, the zero-order ze, which are also the zero-order ’s, 
are copied in the order in which they occur in the formula. 
Consequently, the entries in column (2) are the products of the 
7’s in the second and third lines of each trio of r’s in column (1). 
The entry in column (3) is the first r of each trio minus the 
entry in column (2). The 1 — 7? in column (4), which may be 
found by using a sine table, is for the third r in each trio of r's 
in column (1). Thus, if the trios of r’s are properly arranged 
in column (1), which ean be done by following the general formula 
at the head of the table, the symmetry of the work sheet facili- 
tates all necessary calculations. In using this work sheet, the 
first step is to write in column (5) the subscript for the first- 
order 8 that is to be calculated; this subscript then determines 
the order of the zero-order r’s in column (1). The value of the 
first-order 8, entered in column (5), is found by dividing the entry 
in column (3) by the corresponding entry in column (4). 

The coefficients of partial correlation are readily calculated 
from the B's, as follows:! 


ik = Bu An 
Tisa = BisaBois 

= 0.80673(0.86417) = 0.69715 
Tis3 = 0.83496 
Tis = Bu aa 

= 0.14997 (0.49135) = 0.07369 
13.2 = 0.27146 


Tii = Bonifas 
= 0.05021(0.15358) = 0.007711 


ron = 0.08782 


1 The coefficients of partial correlation could be checked by using any 
one of several formulas, as follows: 


Tij — Tikt jk 
Tilk m a CL 
Vite V1 = r$ 
Tik 
Tijk = Ue 
D 
A 8 MA Te 
ink = Bot S = 
1 — riz 


These formulas all have the advantage that they determine the positive 
or negative sign of the partial r; but the partial r always has the same sign 
as its corresponding 8. Cf. also p. 460. 


ə 
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Thus is determined the arithmetical value of the first-order 
coefficients of partial correlation. Each coefficient of partial 
correlation is positive if the 6’s from which it is derived are both 
positive, negative if the Gs are negative. The respective pairs of 
Gs involved are never of opposite sign. 

The b statistics are calculated from the 8’s as follows: 


gi 
bus = Bak 
oj 


gi 
bizs = Bua- 
KE 


249.9954 _ 
= 0.80673 172915 = 0-80673(0.92903) 
= 0.74948 
43.9354 , 
bisa = 0.14997 ang = 0-14907 (0.46626) 
= 0.06992 
JR GOS a 
bass = 0.86417 120984 = 0.86417 (1.07639) 
= 0.93020 
d Ee 
bsa = 0.05021 672300 — 0.05021 (0.50187) 
= 0.02520 
hi 94.2300 _ 
base = 0.49135 ans = 0.49135 (2.1447) 
= 1.05380 
È 94.2300 _ 9 ax: 
baa = 0.15358 172915 = 0-15358(1.9925) 
= 0.30601 


The first-order a statistics are calculated as follows: 

op = K. — bij X; _ bi; Xe 

diss = 217.4074 — 0.74948(204.074) — 0.06992(515.4816) = 28.4156 
as ıs = 204.074 — 0.93020(217.40740) — 0.02520(515.4816) = —11.148 
45.1 = 515.4816 — 1.05380(217.4074) — 0.30601 (204.074) = 223.929 


The equations of the three planes of regression are, therefore, 
as follows: 


X: = 28.42 + 0.75X; + 0.07X; 
X, = —11.15 + 0.93X; + 0.025X; 
X; = 223.93 + 1.05X, + 0.31X; 
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if X, is considered the dependent variable, it can be estimated 
from the first equation; if X» is considered the dependent vari- 
able, it ean be estimated from the second equation; if X; is 
considered the dependent variable, it can be estimated from the 


third equation. 


The second-order standard deviations, respec- 


tively, about the three planes of regression may also be 


caleulated.! 


2 

Tijk 
2 

03,23 


01.23 
2 
9518 


02.13 
2 
9512 


03.12 


= (1 — r)a — riki) 
zait — ri) (l — ris) 


= 1,930.3155(0.19762)(0.92631) = 353.3585 


= 18.7975 
= cd — n) — ria) 


= 2,236.4883 (0.19762) (0.99229) = 438.5672 


= 20.9416 
= o3(1 ~ Sall — fe 


= 8,879.2866(0. ae 99229) = 5,325.616 


= 72.9767 


The multiple-correlation coefficients, which also measure the 
goodness of fit of the planes of regression, may now be calculated 


as follows:? 


1 They could SS be calculated by using Eq. (19), (p. 415). 


ealeulation of 


Ra 21-— Er 
Rin = 1 — “2 
353.358 
-1- 1,930.3155 = 1 — 0.1830 
= 0.8170 
Ris = 0.9039 
ARD EE a: 
Rig-21- 2.236.1883 = 1 — 0.1961 
= 0.8039 
Haas = 0.8966 


o? a, could be checked not only by using 


cla = of(1 — ris) — risa) 


but also by using the following formula: 


cia = oft — Bizari — Baal 


Thus the 


2 The calculation of R may be checked by using Eq. (20), p. 417. 


3 | 
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5,325.616 

Ri. = 1 — ga79 en — | — 9.5998 
= 0.4002 

Raa» = 0.6326 


An all-round check on the various calculations may be obtained 
by using Eq. (31), Chap. XVI, as follows: 


TABLE 55 


1 = Bin; + Bini 2naBuala.; + oe 


5 


i dh. + Bs e Bas Ens sayas 
: 
(0.80073)? (0.14997)? — 2(0.59373)(0.80673) (0.14997) | 
1.0000 = 0.65081 + 0.02249 + 0.14367 + 0.18304 
` i ois 
T= pa + Bo Chu Sra. Ba Pa R7 


(0.05021)? (0.86417)? 2(0.62894) (0.05021) (0.86417) 
1.0000 = 0.00252 + 0.74679 + 0.05458 + 0.19608 


l9 Ber + Ma + o.D2n Ba Baie 


(0.15358)? — (0.49135)? 2(0.89576)(0.15358)(0.49135) 
1.0000 = 0.02359 + 0.24142 + 0.13519 + 0.59978 


Interpretation of Results Illustrated. The interpretation of 
the above statistics of a trivariate frequency distribution may be 
illustrated by assuming that it is desired to predict X;, the 
second-semester grades of freshmen at the woman’s college V 
selected. From the equation for the plane of regression of X, 
on X; and X;, namely, X; = 28.42 + 0.75X2 + 0.07 X;, esti- 
mates may be made of a freshman’s grade in second-semester 
English if her grades in the verbal scholastic-aptitude test and 
in first-semester English are known. 2 
Estimates Based on Regression Equation. If a freshman's ) 
grade in first-semester English were 300 and her grade in the 
verbal scholastic-aptitude test were 600, her second-term English 
grade would be estimated at 


X; = 28.42 + 0.75(300) + 0.07(600) 
= 28.42 + 225 + 42 
= 295 


6 
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Since the second-term English grade will, of course, be affected 
by other factors, the student’s actual grade in second-semester 
English will deviate from estimates based upon the regression 
equation. This raises the question as to how much, on the 
average, it can be expected that estimates based on the regression 
equation will deviate from the actual values. The answer is 
found by the determination of the value of 1.23, which has been 
found above to be 18.8, or approximately 19. The standard 
deviation of the differences between the actual grades and esti- 
mates based on the above regression equation is therefore about 
19. If this regression equation and first-order standard devia- 
tion are typical of these college grades and if the differences 
between actual and estimated values are in general normally dis- 
tributed, the chances are about Ze that the actual value in any 
partieular case will fall within limits + 38 (= 201.23) from the 
estimated value. 

The foregoing conclusion, which is based on the value of 61.23, 
can be summarized very succinctly by the calculation of Dios 
which has been found to be equal to 0.9039. This is a fairly 
high coefficient of multiple correlation. It shows that the above 
plane of regression is a good fit, and therefore estimates based 
upon it ean be expected to be fairly good. 

Partial-correlation Coefficients. Since both b statistics in the 
equation of regression are positive, it is known that the net 
correlations between X, and Xs and between X; and X; are posi- 
tive. The amount of the net correlation is given by the coeffi- 
cients of partial correlation 715.5 = 0.83496 and ris.2 = 0.27146. 
These show that second-semester English grades are much more 
closely related to first-semester English grades than they are to 
verbal scholastic-aptitude test grades. 

Analysis of Variance in Xs. From the 8?s and the 8 cross 
products, analysis of the variance in second-semester English 
grades can be made. “Thus, from the first set of B?s and cross 
products in Table 55, it is seen that 65.1 per cent of the variance 
in second-semester English grades, Xa, is accounted for by direct 
association with first-semester English grades. Only 2.2 per 
cent is accounted for by direct association with verbal scho- 
lastic-aptitude test grades, although 14.4 per cent of the variance 
in second-semester English is accounted for by indirect asso- 
ciation with both first-semester grades and verbal scholastic- 
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aptitude test grades. The variation in other influences accounts 
for 18.3 per cent of the variance in second-semester English 
grades. 

Under conditions existing at the woman's college studied, it 
appears to be an inevitable conclusion that knowledge of grades 
in verbal seholastic-aptitude tests is not so helpful as might be 
supposed in predieting the subsequent performance of college 
freshmen students. 

Extension of Analysis to Include Four Variables. Additional 
Zero-order Statistics. The extension of the trivariate frequency 
distribution to inelude a fourth variable X, requires first the 
calculation of the mean and standard deviation of the added 
variable. It requires also the calculation of the simple corre- 
lation coefficients between the new variable and each of the 
other three, For illustration, the fourth variable taken is the 
grade in the College Board English examination. Tables 56 to 
58 are the usual work sheets for a correlation problem. From 
them the necessary data are obtained for caleulating the addi- 
tional zero-order statisties, as follows: 


X, — 549.8889 ru = 0.49106 

No} = 501,201.9828 rs = 0.48807 
ci = 6,187.6788 rs = 0.31551 
c, = 78.6618 


Additional First-order Statistics. Among four variables it is 
possible to distinguish four different sets of trivariate frequency 
distributions, each of which will have three planes of regression. 
Accordingly, when four variables are involved the total number 
of first-order 8 statisticsis 24, two for each plane of regression. Six 
of these twenty-four were calculated in Table 54; the remaining 
18 may be obtained by a similar procedure. Table 59 shows the 
24 B's for the illustrated four-variable problem, grouped according 
to the four possible trivariate frequency distributions. 

Each of the four trivariate frequency distributions could be 
analyzed as illustrated in the preceding sections of this chapter. 
From the first-order ës shown in Table 59 all the other first- 
order statistics may be obtained, by methods already explained. 

In few problems is it necessary or even desirable to calculate 
all 24 first-order 8 statistics of the four trivariate frequency dis- 
tributions involved in a four-variable set. As may be seen from 
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Tasue 59.—THe FIRST-ORDER B's IN THE Four TRIVARIATE FREQUENCY 
DISTRIBUTIONS FOR Four VARIABLES 
Data on four kinds of grades of 81 college freshmen, at the Selected Woman's 
College 


First plane Second plane Third plane 


Trivariate Distribution Xi, X», Xs 


Biss = 0.80073 | Bus = 0.86417 | Bus = 0.49135 
Sun = 0.14997 | Bui = 0.05021 | Bii = 0.15358 


T'rivariate Distribution Xi, X», X4 


Biss = 0.86124 | Ba... = 0.86457 | Bus" 0.27260 
Bis = 0.07071 | Boar 0.06352 | Barr = 0.24392 


Trivariate Distribution Xi, Xs, X4 


Bia = 0.52642, | Bu = 0.62463 | Bas = 0.48413 
Bus = 0.32498 | 8, = 0.00877 | Bai = 0.01101 


Trivariate Distribution X: 
| Biss = 0.46447 
Banz = 0.03974 


Bas = 0.48837 
Boss = 0.33400 


an examination of Table 60, it is possible to calculate all the 
second-order 8 statistics if only 18 of the 24 first-order 8 statistics 
are known. If one only of the four planes of regression in the 
four-variable correlation problem is significant or important, it is 
necessary to calculate only 8 of the first-order 8 statistics. 
Second-order Statistics in a Four-variable Problem. In the 
four-variable correlation problem, statistics for four planes of 
regression may be obtained. Following are the four possible 
regression equations: 


2 


Xj = avs + Disca Xa + bisa Xs + binosX4 
X5 = gaang + dorsaX1 + UoscaXa + bauuat: 
X; = asan + bsa Xi + bs2.14X2 + bss12X4 
X4 = a4a23 + Dacos X + bazis X2 + bis.12X3 


Also, for each plane of regression a scatter and a coefficient of 
multiple correlation may be calculated. The procedure is 
similar to that already illustrated; that is to say, the second-order 
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B's are first obtained, and from them all the other second-order 
statistics are calculated. Table 60 illustrates the procedure for 
making the necessary calculations to obtain the 12 possible 
second-order 8 statistics. 

Calculation of Second-order Statistics. In a problem where the 
first-order partial coefficients of correlation are already calculated, 
it is advisable to modify the formula for finding second-order 8 
statistics from first-order 8 statistics as follows: 

According to Eq. (39), Chap. XVI, it was found that 


ae Biik — Bina Bni.k 
de 1 — Bnikbink 


But from Eq. (24), Chap. XVI, it is known that 


2 
Tink = BnikBin.k 


Accordingly, the formula for finding the second-order 8 statistics 
can be modified as follows: 
_ Bot — BinaBni 
Bie = pL. 
jn.k 

In order to secure the greatest convenience in caleulation, the 
arrangement of the items in the work sheet (Table 61) is accord- 
ing to the terms of this formula. First the desired subseript for 
the 8 statistic to be calculated is entered in column (5); then, 
following the formula, the order in which the required trio of 
first-order 8’s appear in column (1) is determined. If this order 
is followed, the entry in column (2) is the product of the second 
two B's of the trio in column (1); the entry in column (3) is 
Tound by subtracting the entry in column (2) from the first 8 of 
the trio in column (1); the subseript of the third 8 of the trio in 
column (1) is the subscript of the partial r for which 1 — r° is to 
be found in appropriate tables or, if preferred, calculated. The 
desired second-order ës are then calculated, by dividing the 
entry in column (3) by the entry in column (4), and entered in 
column (5). 

In problems for which it is not desired to calculate the first- 
order coefficients of partial correlation, the alternative method 
illustrated in Table 61 may be used. It is to be noted that the 
only differences are that an additional 8 must be entered in 
column (1) in each of the sets and that an additional column, 
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Taste 60.—CALCULATION OF THE SECOND-ORDER B's FROM THE FIRST- 


ORDER f's 
Bot — BinkBni.k 
Bijan = Se ae 
in which 
Tak = Bai. Bink 
[See Chap. XVI, Eqs. (24 and 39); 
D Tr [CN [CNN m 
First-order 8 Product Second-order 8 ` ` 
See numerator | 1 — ri» | Subsoripi | Regression 
12.8 | 0.80673 | 0.15094 | 0.65579) 0.84487 | 12.34 | 0.77620 
14.3 0.32498 
42.3 0.46447 
13.2 0.14997 | 0.00281 0.14716| 0.99866 | 13.24 0.14736 
14.2 0.07071 $ 
43.2 0.03974 
14.2 0.07071 | 0.00507 0.06564| 0.99866 14.23 0.06573 
13.2 0.14997 
34.2 0.03378 
21.8 | 0.86417 | 0.16170 | 0.70247 0.84207 | 21.34 | 0.83362 
24.3 0.33400 
41.3 0.48413 
23.1 0.05021 | 0.00070 | 0.04951| 0.99990 | 23.14 0.04951 
24.1 0.06352 
43.1 0.01101 
24.1 0.06352 | 0.00044 0.06308| 0.99990 | 24.13 0.06309 
23.1 0.05021 
34.1 0.00877 
31.2 0.49135 | 0.00921 0.48214| 0.98072 | 31.24 0.49162 
34.2 0.03378 
41.2 0.27260 
32.1 0.15358 | 0.00214 | 0.15144) 0.98451 | 32.14 0.15382 
34.1 0.00877 
42.1 0.24392 
34.2 0.03378 | 0.03474 |—0.00096 0.98072 | 34.21 |—0.00098 
31.2 0.49135 
14.2 0.07071 = 
41.2 0.27260 | 0.01953 | 0.25307) 0.92631 41.23 0.27320 
43.2 0.03974 
31.2 0.49135 
42.1 | 0.24392 | 0.00169 | 0.24223 0.99229 | 42.13 | 0.24411 
43.1 0.01101 
32.1 0.15358 
43.1 0.01101 | 0.01225 |—0.00124 0.99229 | 43.12 |—0.00125 
42.1 0.24392 
28.1 0.05021 
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TABLE 61.—CaLCcULATION OF THE SECOND-ORDER ës FROM THE FIRST- 
ORDER ds 
(Alternative method illustrated) 
Bi Bin Hab 
1 — Buj.kBjn-k 


a) (2) | (3) (4) (5) (6) 
First-order 8 | dede Second-order 8 
roduct 

[| ema | mumer | termet | denomi- ER 

e SE ein | atte 
| 

12.3 0.80673 | 0.15094 | 0.65579 0.155133 0.84487 | 12.34 |0.77620 
14,2 0.32498 
12.3 0.46447 
24.3 0.33400 

13.2 |0.14997 0.00281 | 0.14716 0.001342 0.99866 | 13.24 0. 14736 
14.2 |0.07071 
13.2 | 0.03974 
34.2 | 0.03378 

14.2 0.07071 | 0.00507 | 0.06564 0.001342 0.99866 | 14.23 |0.06573 
13.2 0.14997 
34.2 0.03378 
43.2 0.03974 

De l l , 
If this method is used, the 6’s instead of the 6's could be first calculated, using a similar 


table and the general formula 
bui. * 


kD jnske 


column (4), is required in which to enter the product term of the 
denominator. The item in column (5) is then obtained by 
taking the complement of the corresponding entry in column (4). 
The second-order 8 is found by dividing the entry in column (3) 
by the entry in column (5). For convenience of arrangement, 
the product term of the numerator is written in the order 
BA rather than Gs, and the product term of the 
denominator is arranged in the order 8,/,8,,, rather than 
BA, Except for the convenience in arrangement of the 
work sheet, the order in which such product terms occur is 
immaterial; but, when arranged as indicated, once the subscript 
of the desired second-order 8 is entered in column (6), the order 
in which the first-order 8's occur in the equation may be followed 
in entering them in column (1). There are only four first-order 
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B's in each set, for the third (in the numerator) is repeated in the 
first part of the product term of the denominator. When this 
procedure as to arrangement in the work sheet is followed, the 
entry in column (2) is always the product of the two middle 8's 
in the set of four in column (1), and the entry in column (4) is 
always the product of the last two B's entered in column (1). 

The second-order coefficients of partial correlation are cal- 
culated from the second-order 8's as follows:! 


Tijen = BiitnBitdn 
or, for the four-variable case, 


Tijen = BiiknBii.kn 

11,4, = 0.77020(0.83362) = 0.647056 
rau = 0.80440 

ri. = 0.14736(0.49162) = 0.072445 
13.24 = 0.26916 . 

124,93 = 0.06573(0.27320) = 0.017957 
T14.23 = 0.13400 

än = 0.06309(0.24411) = 0.015401 
esis = 0.12410 

Tha. = 0.04951(0.15382) = 0.007616 
193.14 = 0.08728 

Tias = —0.00098(—0.00125) = 0.000001225 
Tataz = —0.00111 


(The negative sign of the partial r is determined by the negative 
sign of the corresponding 8 statistic.) 

The b statistics of the second order are calculated from the 
second-order 8's in the same way as the first-order b’s from the 
first-order 8's, by the formula 

biis = Bin 2t 
oi 


or, for the four-variable problem, 


P" 
bias = Bus — 
oj 


ei 

bizga = Daa 

[£1 

1 For checking or alternative formulas to find the partial coefficients of 
correlation, see p. 447. 
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= 0.77620(0.92903) 
= 0.72111 
ban = 0.14736 (0.46626) 
06871 
E 43.93543 _ 
bis = 0.06573 Ee) = 0.06573 (0.55854) 


= 0.03671 
bores = 0. pee 07639) 
89730 


= 0. 
bss.14 = 0.04951 (0.50187) 
= 0.02485 


bests = 0.06309 E 


78.6618 
= 0.03793 
94.2300 


PUE ES 
—0.00117 


0.49162(2.1447) 
1.05438 
0.15382(1.9925) 
0.30650 

78.6618 


bares = 0.27320 p 0354 


) — 0.06309(0.60120) 


[um 


) = —0.00098(1.19791) 


[EY 


basa 


Won H H H 


) = 0.27320(1.79040) 


78.66183 _ 
bizs = 0.24411 E SS = 0.24411 (1.66334) 


bas = —0.00125 ESCH = —0.00125(0.83478) 
= —0.00104 


= X; bois baa X, — Dinit Xn 

= 217.4074 — 0.72111(204.074) — 0.06871(515.4816) 
— 0.03671 (549.8889) 

= 14.64244 
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It will be noted that, with the exception of those involving c, 
the standard-deviation ratios used in the above calculations 
have all been computed. and may be copied from the preceding 
section, where the first-order b/s were calculated from the first- 
order B's. 

The second-order a statistics are calculated as follows: 


H 
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asas = 204.074 — 0.89730(217.4074) — 0.02485(515.4816) 
— 0.03793 (549.8889) 
= —24.67266 
asa = 515.4816 — 1.05438(217.4074) — 0.30650 (204.074) 
+ 0.00117 (549.8889) 
= 224.34628 
dion = 549.8889 — 0.40604(204.074) + 0.00104(515.4816) 
— 0.48914(217.4074) 
= 361.22013 


The equations for the four planes of regression may now be 
written as follows: 


X! = 14.64 + 0.721X, + 0.069X3 + 0.087 X, 
X, = —24.67 + 0.897 X; + 0.025X + 0.038X, 
X! = 224.35 + 1.05X, + 0.306X2 — 0.0012X 
X! = 361.22 + 0.489X, + 0.406X; — 0.001X 5 


Tf X, is considered the dependent variable, it can be estimated 
from the first equation; if Xz is considered the dependent variable, 
it can be estimated from the second equation; if X; is considered 
the dependent variable, it can be estimated from the third 
equation; if X, is considered the dependent variable, it can be 
estimated from the fourth equation. The standard errors of 
estimate, that is, the scatters, respectively, about the four 
planes of regression may also be calculated." 


ius = gll — Tisi) 

Sien = ei sall NS Tiias) 
— 353.34(0.98204) 
= 346.9940 


01.34 = 18.628 
gi = Chal — "al 
= 438.53(0.98460) 
= 431.7766 
02434 = 20.779 
Sin = gëss — T3419) 
5,325.56 (1.00000) 
5,325.56 
03.54 = 72.976 


1 For alternative methods, see p. 415 and Eq. (19), Chap. XVI. 
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oun = Glas (1 — risas) 

= 4,622.8210(1.00000) 
1,022.8210 
84.123 = 67.991 


lI 


The multiple-correlation coefficients, which measure the good- 
ness of fit of the planes of regression, are calculated in the same 
way as for the trivariate problem, namely,' 


d 
ege Tijkn 
vijka 7 2 


[41 
346.994 
S = SE Kg 
dis = 1 1,930.3155 1 — 0.17976 
= 0.8202 
iaa = 0.9056 
2 431.7766 
Hia = 1 22236.4883 — 1 — 0.19306 
= 0.8069 
Reiss = 0.8983 
, 5,325.56 — 
Rian = 1 — g 379 9866 ~ 1 — 00998 
= 0.4002 
Jann = 0.6326 
4,622.8210 `, as 
Ria Í = 1 — sem" 1 — 0.74710 
= 0.2529 


Rass = 0.5029 


For the four-variable problem, the equation for the 8 squares 
and 8 eross products is as follows: 


Bain + Bii F Buje H 2riBuanB iin + 2riBianBin.ir A 
+ 2riBi.inBin.ik + zs -1 
In Table 62 some of these checks are illustrated. 

Interpretation of Results Illustrated. The interpretation of 
the above statisties of a four-variable frequency distributior 
may be illustrated by assuming that it is desired to predict thi 
second-semester English grades of freshmen at the woman's 
college selected; in other words, the X; is assumed to be the 


! For an alternative method, see Eq. (20), Chap. XVI. 
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00000°T = 
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dependent variable. From the equation for the plane of regres- 
sion of X; on Xs, Xs, and Xu, namely, 


X: = 14.64 + 0.721X, + 0.069X 3 + 0.037X4 


estimates may be made of a freshman’s grade in second-semester 
English if her grades in the verbal scholastic-aptitude test, in 
College Board English, and in the first-semester freshman 
English course are known. 

Estimates Based on Regression Equation. If a freshman's 
grade in first-semester English is 300, in the verbal scholastic- 
aptitude test 600, and in College Board English 500, her second- 
semester English grade is estimated as follows: 


14.64 + 0.721(300) + 0.069(600) + 0.037(500) 
14.64 + 216.3 + 41.4 + 18.5 
= 291 


X; 


lI 


Since the second-semester English grade will, of course, be 
affected by other factors, the student’s actual grade in second- 
semester English will deviate from estimates based upon the 
regression equation. ‘This raises the question as to how much 
on the average it can be expected that estimates based on the 
regression equation will deviate from the actual values. The 
answer is found by the determination of the value of sisa, 
which has been found above to be 18.6, or approximately 19. 
The standard deviation of the differences between the actual 
and the estimated grades in second-semester English is therefore 
about 19. If this regression equation and second-order standard 
deviation are typical of these college grades and if the differences 
between actual and estimated values are in general normally 
distributed, the chances are about ë that the actual value in a 
particular ease will be within limits +38(= Ze: aal from the 
estimated value. j 

The foregoing conclusion, which is based on the value of 
71334, can be summarized very succinctly by the calculation of 
Rasa, which has been found to be equal to 0.9056. 

If this result is compared with the estimate based on only two 
independent variables, it is found that the standard error of 
estimate is almost as large for the plane based on three jndepend- 
ent variables as the standard error of estimate based on two 
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independent variables.! In other words, very little increase in 
accuracy was obtained by including the fourth variable into 
the correlation problem. This same conclusion is borne out 
by comparing the coefficients of multiple correlation. Thus 
Russ = 0.9056, while Ria = 0.9039, which is nearly as large, 
indicating that the trivariate plane was nearly as good a fit as 
the four-variable plane of regression. 

Partial-correlation Coefficients. The unimportance of knowl- 
edge of grades in College Board English examinations in pre- 
dicting the grades of freshmen in second-semester English is 
explained also by the small partial-correlation coefficient between 
X, and X, when X, and X; are held constant. This partial- 
correlation coefficient is given as 7145s = 0.1340. 

Analysis of Variance in Xi. These conclusions are further 
indicated by the nature of the 8 squares and the 8 cross-product 
terms. From the first equation in Table 62 it is seen that the 
various proportions of variance in X, are accounted for as 
follows: 


60.25 per cent by correlation with first-semester English grades. 
2.17 per cent by correlation with verbal scholastic-aptitude 
tests. 
0.43 per cent by correlation with College Board English exami- 
nations. 
13.58 per cent by indirect correlation with first-semester English 
grades and verbal scholastic-aptitude tests. 
4.98 per cent by indirect correlation with first-semester English 
grades and College Board English examinations. 
0.61 per cent by indirect correlation with verbal scholastic- 
aptitude tests and College Board English examinations. 
17.98 per cent by correlation with other factors independent of 
first-semester English grades, verbal scholastic-aptitude 
test grades, and College Board ‘English examinations. 


The small percentages attributable to College Board English 
examination grades, either directly or indirectly, are apparent 
from these statistics. Evidently, under conditions existing at 
the woman’s college, grades on the College Board English 


1 Of. p. 451. 
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examination were of little value for predicting how well the 

students would do in their college freshman English courses.! 
Another approach to the study of variance in X, could be 

made as follows: It was noted above for three variables that? 


2 22 2 2 2 
a? = rji F Tisai + Cios 
For four variables, 
wier, SS Ki 2 2 A 2 
oi = Tiai + isai. + Tiai + 01,234 


which may be expressed in proportions as follows: 


2 2 2 
0j.» c o 
= š d 1.2 2 1,23 1.234 
L = riy E Misses naa Pa 
91 Kéi 91 


This expression means that the total variance in X, is com- 
posed of four parts as follows: the part that is due to total simple 
linear correlation with Xe, the part that is due to partial correla- 
tion with X; when X; is held constant, the part that is due to 
partial correlation with X, when X; and X;'ure held constant, 
and the part due to other causes independent of X», Xs, and X4. 


The expression Tas = deseribes the proportion of the variance 
1 
in X, that is explained as a result of adding X; to the regression 


equation, while r1, 3 7133 describes the proportion of the variance 
n 
in X, that is explained as a result of adding X, to the regression 


equation; the influences of X; and X, that result from their 
association with X» are already contained in risi By sub- 
stituting the values of the four above terms in the illustrated 
problem, it becomes 


-260 381.46895 353.858. 

1.00000 = 0.80238 + 0.07369 T 930.3155 + 0.017057 7 030.3155 
, 346.9940 

1,930.3155 


, 


or 


1.00000 = 0.80238 + 0.01456 + 0.00329 + 0.17977 


‘It will be noted, however, that ris = 0.49 so that approximately 25 per 
cent [ = (.49)?] of the variation in X; may be estimated from knowledge of X4 
alone. 

* Cf. Chap. XVI, Eq. (33), p. 428. 
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From this expression it may be said that 80.2 per cent of the 
variance in X is accounted for by total correlation with Xo, a 
further 1.4 per cent is accounted for by additional correlation 
with X, and a further 0.3 per cent is accounted for by additional 
correlation with X4, the remaining 18 per cent being due to other 
influences independent of X», Xs, and X;. In other words, by 
making a four- instead of a three-variable correlation problem, 
that is, by including the College Board English examination 
grades, only an additional 0.3 per cent of the variance in second- 
semester English grades is explained. 


CHAPTER XVIII 
NORMAL FREQUENCY SURFACE 


THE BIVARIATE HISTOGRAM 


The study of frequency surfaces begins logically with a geo- 
metrical representation of a bivariate frequency distribution 
known as a “bivariate histogram." To visualize the histogram 
that would represent the distribution of Table 25 (page 326), 
consider an ordinary checkerboard. Let the side and top of 
the board be calibrated with the class-interval scale shown in 
Table 25, and let 81 checkers be taken to represent the 81 
students. On the checkerboard square in the row headed 60— 
and the column headed 120-, let one checker be placed; on the 
square in the row headed 100— and the column headed 60-, let 
two checkers be placed; on the square in the row headed 100- and 
the column headed 100-, let one checker be placed; and so on, 
until all the squares on the checkerboard for which there are 
frequencies in Table 25 are covered with the proper number of 
checkers piled on top of each other. 

If the checkers were square rather than round, they would 
stand up better and fill in all the area, helping to support each 
other. If they were square, the resulting figure would resemble 
a histogram for the given bivariate frequency distribution. A 
picture of what such a histogram would look like is given in 
Fig. 121. 

In the foregoing example the heights of the various piles of 
checkers represented the frequency of each cell. It would be 
possible however, so to adjust the vertical scale that the heights 
of the piles of checkers represented the relative frequency of 
each cell. If the checkers were square, giving a histogram 
proper, then, further, it would be possible to adjust the vertical 
scale so that the volume of each pile of square checkers measured 
the relative frequencies. For example, since the class intervals 
are 20 units each and the area of any cell is thus 400 square 
units, the height of a pile of checkers taken to measure a rela- 
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tive frequency of, say 0.08, would be 0.0002 unit. This would, 
of course, be very small; but then, in any model, the vertical 
unit could be taken sufficiently large to offset this. That is, 
instead of letting + inch represent 1 unit (the thickness of one 
checker, say), it would be possible to let 10,000 inches represent 
1 unit. Then 0.0002 units would be the equivalent of a pile of 
eight checkers. 


S 


AAO 
9:9 9 
— S^ 

zs 


Fig. 121.—Histogram representation of a bivariate frequency distribution. 
Rectangular blocks on the other side of the mean point are presumably obscured 
from view. E 


Suppose, now, that a histogram is constructed so that volumes 
of the square checkers erected on each cell represent the relative 
frequency of that cell, and suppose that the number of cases is 
indefinitely inereased and at the same time the size of the class 
intervals is made infinitesimally small. The result would be a 
solid figure the top of which would tend to trace out a smooth 
surface. This would be a frequency surface. A frequency sur- 
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face is thus the limit approached by a bivariate histogram as the 
number of cases is indefinitely increased and the sizes of the class 
intervals indefinitely reduced. If an area is traced out in the 
XX. plane, the relative frequency of cases falling in this area is 
given by the volume under the surface over that area. 


FREQUENCY SURFACES 


Frequency surfaces may assume all sorts of shapes. They may 
be symmetrical and bell-shaped, or they may be distorted by 
skewness or excessive peakedness or flatness, depending on the 
types of forces underlying the variation in the two variables. 
First will be considered the case of a bivariate surface for variables 
that are normally distributed and are independent of each other. 

Bivariate-surface, Independent Variables. A monovariate 
frequency distribution, it will be recalled, showed the relative 
frequency of occurrence of various values of a given variable. A 
joint, or bivariate, frequency distribution shows the relative 
frequency of occurrence of various pairs of values of the two 
given variables. Suppose, for example, that a marksman is 
shooting at a target. The scatter of dots about the center of 
Fig. 122 may be taken to illustrate the results of a large number 
of such shots. The position of any particular shot relative to 
the center of the target may be indicated by the amount of its 
horizontal deflection (call it x2) and by the amount of its vertical 
deflection (call it xı). The relative frequencies of various types 
of shots may consequently be indicated by the relative fre- 
quencies of various combinations of horizontal and vertical 
deflections, that is to say, of various pairs of values of z; and zs. 

The relative frequency of shots in any given area of the target, 
the ziz» plane shown in Fig. 122, may be indicated by the density 
of shots in that area or by the volume of some frequency surface 
constructed over the z;t» plane. The use of the surface for this 
purpose is illustrated in Fig. 123. 

It will be noted that the shots tend to be distributed sym- 
metrically around the center of the target. No tendency for 
large vertical deviations to be associated with large horizontal 
deviations in either a positive or a negative direction is evident. 
Also, no tendency for vertical deviations to vary in any par- 
ticular way with horizontal deviations is apparent. 
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Normal Bivariate-surface, Independent Variables. ` Figure 123 
illustrates the normal bivariate-surface independent variables; it 


X; X2 


0 IX, 
(hela Dy LS ln Lay ta Ot E 
Fra. 122.— Distribution of shots at a target, representing a symmetrical bivariate 
distribution, 


Fia. 128.—A normal bivariate frequency surface, independent variables. [Here 
9$; =o). 


and the example described above illustrate the characteristics 
of a bivariate distribution where there is no correlation between 
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the two variables. This may be summarized as follows: There 
is no correlation, that is to say, the variables are independent of 
each other, because (1) for any given value of X;, the distribu- 
tion of values of Xs is the same, with the same mean and standard 
deviation, as for any other value of X;; (2) for any given value 
of Xs, the distribution of values of X; is the same, with the same 
mean and standard deviation, as for any other value of X;. 
When each variable is the result of a set of forces that will produce 
a normal frequency distribution in that variable alone and when 
the two sets of forces operate independently of each other, the 
result will be a normal bivariate frequency distribution with no 
correlation. The easiest way of generating a normal bivariate 
frequency surface is to suppose that a form of the normal frequency 
curve is held in a position perpendicular to the base plane, as in 
Fig. 124. A knob is fixed to the top of the frequency curve at 
B, and the center of the base 
ine of the frequency curve is 
fixed at A, so that it can revolve 
but so that the line BA always 
remains perpendicular to the 
base plane CD. 

If the form of this normal 
requency curve is revolved in a 
complete cycle until it reaches 
its original position again, the C 
requency curve will "describe" 
the surface of the bivariate Fie. 124.—The normal curve, 
normal frequency surface for in- revolution of which will produce 
dependent variables, and it will Pe 12% 
be like a system of symmetrically concentric circles such as 
that shown in Fig. 123. For such a distribution of pairs of 
Observations X; and Xs, r = 0, for the xxs products are dis- 
tributed equally in the four quadrants, minus products canceling 
plus products. 

Mathematical Representation of Normal Bivariate-surface, 
Independent Variables. Use of the X; and X» as the Origin. As 
noted in the discussion of Fig. 122, the various XX» points 
plotted in a bivariate plane may, with no difficulty, be described 
in terms of their distances from the respective means. This has 
the effect of shifting the axes so that the new axes are the lines 
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drawn perpendicular to the means of the respective seales; the 
vertical line drawn through X; in Fig. 122 is the z-axis, and the 


horizontal line drawn through X; in Fig. 122 is the z;-axis. 
Vertical and horizontal deviations from the center of the circle 
are xı and zs variates. For many purposes it is more convenient 
to use this method of describing points in a bivariate plane than 
to use the original scales as the point of reference. In the follow- 
ing pages, the more frequent appearance of a; and 2», instead of 
the capital letters, will be understood to signify the shift from 
reference to the original axes to reference to the axes with the 
origin at the means of the two variables. 

Probability of Each Variate Taken Separately. If xı is a nor- 
mally distributed variate above and below the X; and is com- 
pletely independent of 2», the probability or relative frequency of 
any value of x; between 21 and x, + da, whether associated 
with large or with small values or with positive or negative 
values of zs, will be given by 


1 EE 
e 28 dx; (1) 


dP(zi) = 


si V 20 


Similarly, if xə is a normally distributed variate and is completely 
independent of z;, the probability or relative frequency of any 
value of x between zs» and zs + dae, whether associated with 
large or with small values of vı or with positive or negative 
values of vı, will be given by 


dP(z)) = ——=e "das (2) 


Joint Probability of Two Variables. The joint probability or 
joint relative frequency of an xı between vı and x; + dx, occur- 
ring in association with an zs between zs and ze + dz» is the 
product of the above two probabilities. In other words, the 
joint probability of the two variables occurring in pairs of any 
combination is given by i 


1 E 
— e Zei £: dy, dr (3) 


dP(zza) = — = 
es o1 V2 o2 win 
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whieh reduces to the following form: 


lfzrr 


dP (aaa) = EL de (4) 


010227 


Fic. 125,—A normal bivariate frequency surface, independent variables. [Here 
e >o). 


Geometrically, the dP(zizs) expressed in Eq. (4) describes the 
volume of a column with breadth and width of dz; and dz» and a 
lfz?,zyO 
height equal to RS es), Such a column is shown at 
: 010527 
P in Fig. 125. 
The normal bivariate surface may be described, therefore, 


as follows: 


Iren, ast 
IT — ar (5) 
010227 


f(ait2) = 


If the two standard deviations are equal, the normal prob- 
ability surface is circular like Fig. 123. Horizontal planes 
parallel to the base will intersect the figure in the form of circles 


) 
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becoming smaller as the plane is elevated. Any vertieal plane 
parallel with the z;-axis (a line through the X;) will intersect the 


figure in the form of a normal 
curve with a standard deviation 
equal to oi; and any vertical 
plane parallel with the «2-axis will 
intersect the figure in the form of 
a normal curve with a standard 
deviation equal toe». If the two 
standard deviations are equal, 
these normal curves will be 
identical. 

If, however, the two standard 
deviations are not equal, the 
normal bivariate surface will be 
elliptical in form, as shown in 
Fig. 125, rather than circular. 
Vertical planes drawn as before 
will nevertheless biseet contours 
of normal curves. The vertical 
normal curves will have standard 
deviations equal to ø and the 


x, 


Fro, 126.—A horizontal section of 
the frequency surface of Fig. 125. 


horizontal normal curves will have standard deviations equal to 
v» Horizontal planes parallel to the base in Fig. 125 will inter- 
sect the figure in the form of ellipses, which will become smaller 
as the plane is elevated. Figure 126 is the sort of ellipse that 
would be obtained by the intersection of a plane horizontal to the 
base plane of Fig. 125. The equation for the ellipse shown in 


Fig. 126 is 


0.321 + 6.74 — 32 = 0 


or 


TEN E 


Pairs of xı and c; that satisfy this equation are: 


D 


He be HE HEE 
wer RSS 
Íoococq 
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Bivariate-surface, Correlated Variables. Instead of two inde- 
pendent variables, suppose there is a set of paired variables in 
which is displayed a marked tendency for positive correlation, so 
that large values of X, are associated with large values of X», 
and vice versa. This is the same as to say that positive values 


Fig. 127.— Horizontal view of correlated variables. 


of xı occur predominantly with positive values of 2; and negative 
values of x; occur predominantly with negative values of ze, the 
small z's measuring in each case the deviations from respective 
means. Assume that each distribution taken separately is a 
symmetrical one like a in Fig. 127 and a in Fig. 128. In Fig. 
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127 let a represent the frequency curve of the total distribution 
of the Xs variable. Then suppose this frequency distribution of 
all the variants of the variable Xs is cross-classified into three 
groups, (1) those Xs associated with large values of X;, (2) 
those associated with the ordinary or average range of values of 
X, and (3) those associated with small values of X;. 


X, 


Fie, 128.— Vertical view of correlated variables. 


The plane is accordingly divided vertically into three parts 
representing the range of (1) large values of X; (this part of the 
plane is labeled 8 in Fig. 127); (2) ordinary or average range of 
X; values, represented in the figure by y; and (3) small values of 
Xa, represented by 6 in the figure. 

By summarizing in a group those variates of X, associated with 
large values of X, (those in the range of 8 in Fig. 127), and under 
the assumption that large values of Xs are associated with large 
values of X., a frequency distribution like b, whose mean would 
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be larger than the X, of the total population of variables X», 
would be obtained. The line AA’ intersects the base of the 
frequency curve b at its mean point. 

By summarizing in a group the X» variables in the y range of 
Fig. 127, a frequency distribution of X» variables like c would be 
obtained; then the one showing the X» variables associated with 
X; in the range of ô would give a frequency distribution like d. 
The line AA’ in Fig. 127 also passes through the mean of the 
frequency curve d. In other words, the means of curves b, c, and 
d, all lie on the same straight line, AA’. 

The X, variable is treated in a similar manner in Fig. 128, in 
which a represents the frequency curve of all of the values of the 
X, variable. This frequency distribution of all the X; variables 
is then eross-classified into three groups, (1) those associated 
with small values of X», (2) those associated with ordinary or 
average range of values of X», and (3) those associated with large 
values of Xs. The plane of Fig. 128 is accordingly divided hori- 
zontally into three parts, representing the range of (1) small 
values of X (this part of the plane is labeled 8 in Fig. 128); (2) 
ordinary or average range of X» values, represented in the figure 
by y; and (3) large values of X;, represented by à in the figure. 
By summarizing in one frequency distribution the variates of X, 
associated with small values of X, (those in the range of 8 in 
Fig. 128), under the assumption that small values of X; are 
associated with small values of X», a frequency distribution like 
b, whose mean is smaller than the mean of the total population 
of variable X;, would be obtained. 

By summarizing in one group the X; variables in the range y 
of the Xs variable, a frequency distribution of X; variables like 
c would be obtained; the group of X; variables associated with 
X, in the range of à will give a frequency distribution like d. 
'The line passing through the means of these three frequency 
distributions would be Eke BB’ in Fig. 128. 

Normal Correlation Surface, Correlated Variables. A bivariate 
frequency distribution showing the joint variation of two cor- 
related variables would thus appear to be represented by a 
frequency surface that is turned so as to make an angle with the 
xı- and z=axes. A picture of a normal bivariate frequency’ sur- 
face for correlated variables is shown in Fig. 129. Figures 127 
and 128 constitute analyses of the frequencies of Fig. 129 that 
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divided the surface into three parts, first up and down and second 
left and right. The three figures, therefore, are an attempt to 
view the same distribution in three different ways. If any cross 
section is taken of the surface represented by Fig. 129, parallel to 
the X -axis, the cross section will have the form of a normal 
frequency curve with its mean on the line bb’. Any cross section 


X, 


`< 


eh P 


Fra. 129.—A normal bivariate frequency surface, correlated variables. 


of this surface taken parallel to the X-axis will have the form of 
a normal frequency curve with its mean on the line aa’. Such 
cross sections are similar in character to the frequency curves 
b, e, and d, discussed in connection with Figs. 127 and 128, 
respectively. Typical cross sections are likewise shown in 
Fig. 129. 

Careful study of Figs. 127 to 129 will aid greatly in the under- 
standing of the theory of correlation. They serve also as the 
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basis for comprehending the theoretical explanation in the ensuing 
section. 

Derivation of Equation for Bivariate Normal Frequency Dis- 
tribution, Correlated Variables. Equation of a Rotated Ellipse. 
A quadratic equation of the general form 


aX? + 2hXiX_ + OX? + %9Xi+ YX. Les (6) 


is an ellipse under the following 
conditions: 


ab — h: > 0 and D#0 
where 


ahg 
D = |h b fi = abe + hgf + gfh 


gfe 
— af? — ch? — bg? 


For example, the equation 


Xi 4X,X: + 6X3 — 4X, 
+ 64X: 4-144 — 0 (6) 


is an ellipse like that shown in Fig. 

130, expressed with reference to the 

large X,X.-axes. The center of the 
š SUME SE Zitt a 

Sech EE Fra. 130—A horizontal 
The equation for an ellipse with cross section of a normal bi- 

^ i variate surface, correlated 

reference to the axes passing through EEN 

its center is? 


on + 2h'zyzs Lakes (7) 
where a’ =a, k =h, V —b, and e = D/(ab — h’). 


For Eq. (6) the new form is 3 
zi — Ama. + 62} — 32 = 0 7) 


1 Frye, H. B., and H. D. Tuompson, Coordinate Geometry, pp. 137-138. 
2 The center of the ellipse is found by solving the following two equations 
for X, and Ka 
aX: +hX: +g = 0 
AX; +bX, +f = 0 
In this problem, a = 1, h = —2, g = —12, b = 6, and f = 32. 
* Cf. Fine and THompson, op. cit. 
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Following are the solutions for Eqs. (6’) and (7'), from which 
Fig. 130 was drawn, the two equations describing the same 
ellipse: 


EqvamrOoN (6^) Equation (7) 
Solution: Solution: 


X; = 2X, +12 + Vr 2X; + 8X2) ay = 22x. + 0/32 — 223 


X: Xi Ta zi 

Q0 9E Ww 0 = 12 0 0 + 4/32 5.7 — 5.7 
—1 104Vi4 =13.74 63 EL £24780 = 7.5 $3.5 
-2 8++/24 = 12.9 31 42 +4424 = +8.9 70.9 
-3 647/30 =11.5 0.5 £3 ët = 49.742.3 
CA fea/ = 97 -—LT EB ETH Tb = 19,7 14.8 
LE op 7.5) —8 5 Eé XB EV OO = +S 


The equations of the axes of the ellipse are obtained by finding 
the positive root of A in the following equations: 


Wr? + (a — b')A— h' = 0 
or, in this case, 
—2? —5\+2=0 
A = 2.85 
The equation for the major axis of the’ ellipse is therefore 
xı = 2.8525, and the equation for the minor axis is pe = —2 85x. 


Referred to its own major and minor axes, the equation of the 
ellipse is Az? + Ba} + C = 0, where A and B are obtained from 


A+B=a'+0' AB =a'b! — h”? C-c 


and the condition that A — B has the same sign as h’. For this 
ellipse it is thus found that A = 0.3 and B = 6.7. The equa- 
tion for this ellipse referred to its own axes (see Fig. 125) is 


0.302 + 6.702 — 32 = 0 


Mathematical Representation of a Bivariate Normal Correlation 
Surface. It was noted above that the bivariate normal surface 
in which x; and ze are independent of each other (that is, in 
which no correlation exists between them) is of the form 


dP(rurs) = o Sen day das (8) 


f 
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The constant term, 1/2m01¢2, is a constant dependent on the 
values of the two standard deviations in any particular instance. 


NES 
The product of this constant times the terme 2er 7)/ gives, 
for various values of z: and zs, the height of the bivariate surface 
from the base (the distance OP in Fig.125). If ahorizontal plane 
parallel with the base plane is drawn through the normal bivariate 
surface at a distance OP from the base plane, the intersection of 
the plane and the bivariate surface will be an ellipse (as in Fig. 
126) if the standard deviations are unequal; the intersection will 
be a circle (as in Fig. 123) if the two standard deviations are equal. 
Such a plane represents the locus of all points distant OP from the 
base plane, and the passing of such a horizontal plane through 
the bivariate surface is equivalent to setting the expression, 


1 far s 
t SE at 5) equal to a constant which is equivalent to putting 


2 2 
ai, 2$ 
SUED 
1. $02 


This equation represents a circle if s; = e» and an ellipse if 
o Fon. 

The smaller the constant c, the smaller will be the circle or 
ellipse until, at the peak of the bivariate surface a very small 
circle or ellipse will be found—finally, just a point. 

If the two variables are correlated, two changes occur. (1) 
The ellipse is rotated. (2) The ellipse is narrowed. If before 
correlation the surface is circular in form, owing to the fact that 
the standard deviations are equal, the existence of correlation 
will cause the circle to be converted into a rotated ellipse, narrow- 
ing the circle to an elliptical form. If before correlation the 
surface is elliptical in form, owing to the fact that the standard 
deviations are unequal (see Fig. 126), the existence of correlation 
will cause the ellipse to rotate and also to become narrower. 
This phenomenon is explained as follows: 

If larger than average values of X; cause Xs to be larger than 
average and smaller than average values of X; cause X» to be 
smaller than average, the pull exerted on X> values js indicated 
by the arrows in Fig. 131. The larger the Xi, the more pull 
will be exercised upon Xs to make it larger than its average. 
This is indicated by making arrow (1) longer than arrows (2), 
(3), and (4), which, respectively, represent the degree to which 
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successively smaller values of X; affect values of Xz, until, by 
the time X, becomes smaller than average (below the line Xi), 
arrow (4^) points to the negative pull, that is, causing Xs to be 
less than its average. 

When correlation exists, this means that bivariate frequencies 
located in quadrant II, where X; is larger and Xs is smaller 
than average, tend to move over to quadrant I, where X; and 
X,are both larger than average. Bivariate frequencies already 
located in quadrant I are less affected. Similarly, bivariate 
frequencies in quadrant IV tend to move to quadrant III, where 


X, X, 
Gm ra = 


Xo lI + X; 

Ee x; 

Fic. 131.—HIllustrating the difference between the nonexistence and existence of 
correlation in a normal bivariate frequency surface. 


both X, and X, are smaller than average, while bivariate fre- 
quencies in quadrant III are less affected. The result is that 
the rotated ellipse becomes narrowed as shown in the part of 
Fig. 131 at the right. Any horizontal plane parallel to the base 
of a correlated bivariate (Fig. 129) will intersect the bivariate 
frequency surface in the form of an ellipse such as that shown in 
the right half of Fig. 131—large ellipses near the base plane, and 
smaller and smaller ellipses as the horizontal plane is raised 
higher and higher from the base. These ellipses have the 
equation 


ax} + 2hxics + br? + c = 0 


"fa" 
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As already noted, the middle term 2hzizs is present in the 
equation because of the fact that the ellipse is rotated and now 
described in terms of axes other than its own, although the origin 
remains the center of the ellipse. The middle term is thus 
present because of correlation, which causes the rotation of the 
ellipse. This middle term is generally called the “ product term” 
because it is the product of the two variables. When there is 
no correlation, this middle term disappears. The narrowing of 
the ellipse, as will be seen, results in the increase in the value of 


the constant term n 
2mo172 


Since the normal bivariate surface in which X, and X; are 
correlated is thus elliptical in form but rotated and narrower than 
the elliptieal surface representing uncorrelated bivariates, the 
distribution of probabilities or relative frequencies will be given 
by an expression of the form 


dP(ayzs) = be-Westäieueiiec) day das (9) 


This is the general formula for a normal bivariate frequency 
distribution of eorrelated variables. The remainder of the 
argument, which appears in the Appendix to this chapter, shows 
how the parameters k, a, h, and b may be evaluated in terms of 
the moments of X, and Xa When the proper values of the 
parameters are inserted, the formula is as follows: 


i wi? mima +22) 
dP(2x22) = i er e 20 (ESO een $6) dy, day (10) 
2mo102 Mi 


This probability expression describes a normal bivariate 
frequency distribution such as that graphed in Fig. 129. The 
rotated position is reflected in the fact that the exponent of e 
has a middle “product term.” The fact that the surface is 
narrower than it would be if there were no correlation is reflected 
in the character of the constant term, which is larger than the 
constant term of a normal bivariate frequency surface of uncor- 


'See p 475. 
* See Appendix, pp. 492-496. 
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related variables.! In other words, because r cannot be greater 
than 1, 


1 ee 


moa V1 — 7 Zoe: 


The degree to which the constant term in the correlated surface 
is larger depends upon the value of r. If r = 0, the constant 
term becomes identical with the constant term of the uncor- 
related surface. If r = 1, the constant term of the correlated 
surface becomes infinitely large, reflecting the fact that when 
r = 1 the surface becomes so narrowed that it is a plane, all 
points being on the line of regression. 


LINES OF REGRESSION 


In the discussion of Fig. 127 it was pointed out that the line 
AA’ passes through the means of frequency distributions a, b, 
and c. Similarly, in the discussion of Fig. 128, it was said that 
the line BB’ passes through the means of frequency distributions 
a,b, and e In the discussion of Fig. 129 the line aa’ was said to 
pass through the means of any frequency distribution made by a 
vertical plane parallel with the x-axis, and the line bb’ was said 
to pass through the means of any frequency distribution made 
by a vertical plane parallel with the z=axis. These two lines 
are thus the progressions of the means for the normal bivariate 
surface. As will be shown shortly, they are also the least-squares 
lines that might be fitted to the surface. In both senses, there- 
fore, they are the lines of regression for the surface. 

If there is no correlation, as illustrated by Figs. 122, 123, and 
126, the two lines of regression correspond with the major and 
minor axes of the ellipse, that is, with the axes represented by 
the X, and X, lines of Fig. 122 or Fig. 126. By hypothesis, in 
the uncorrelated bivariate surface the mean of any frequency 
distribution made by a vertical plane parallel to the x-axis will 
be on the X, line, and the mean of any frequency distribution 
made by a vertical plane parallel to the x»-axis will be on the X> 
line. When the surface is rotated and narrowed, as a result of 
correlation, it is part of the hypothesis that the normal symmetry 

1 The narrowing is due to a stretching upward of a given volume. As 


indicated, in the limiting situation (r = 1), the surface becomes a vertical 
plane stretching to an infinite height and having an infinitesimal thickness. 
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of the surface remains and accordingly the means remain in a 
straight line, but a straight line at an acute angle rather than 
perpendicular to the original axis. 

Mathematical Representation of Lines of Regression. The 
bivariate normal correlation surface in terms of probabilities 
has been found to be described as follows: 


1 SIDD 


ai Tue a) dy dry (11) 


P EI ui aa 


Pes) = =  — 
(neg Qos wl — r? 


A line of regression, for example, the line of regression of z+ on 
^; is a general description of the law of relationship by which 
lor a given value of xı the most probable value of z; may be 
determined. Equation (11) describes the joint probability of any 
bivariate ups, The probability of any value of xs occurring 
with some specified value of xı, say £, will be as follows: 


1 CE zy m tum ES 
Plëss 
e) CATA Vi — 


-e 2ü-rLer n ka dat 


(12) 


If (Ziel? is factored from the exponent of e, the equation 
becomes 


1 SE EE adis) 
dP (ya) = ——— e $mü-r)g 20 =F) est "n 0/ den dz, 
2z0172 V/1 — 
(13) 
Xi 
The square of 7$ — 2r hn is completed by adding Tum 
oi 0102 Kéi 


which must also be subtracted to keep the value of the whole 
expression unchanged. This subtracted part may be conven- 
iently put with the other (£:)? term so that the final result of 
these operations is as follows: 


1 — rr? me 


230 
e =r) e E tel el dz, das 


(14) 


dP(£i) = 


2noye 4/1 — r? 


Upon simplifying the exponents and splitting up the constant 
term and the dx dis, this expression becomes 
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15. 
dE Qm 
= Le w=) dr, 


(15) 


Since the first factor is a constant (á) being given), Eq. (15) 
shows that the probability of an Ze for a given value of a is pro- 
portional to the probability of a normally distributed variate 


whose mean is 7 z #, and whose standard deviation isos y1 — r°. 
1 
(It will be recalled that the general equation for the normal 
zi 


curve is ——À à. Accordingly, the most probable value 
o Vr 


of z for specified values of z, that is, the line of regression of 
æ, on 21, is as follows: 


sl = r y; 
2 ER 1 
The standard deviation or scatter about this line is oz MA — rr. 
From Eq. (15) itis seen that the locus of all points representing 


* H Li . H 
the means of x: for a given zí is 2» = "7. £, which is the equa- 
1 


tion of the line of regression of £2 on xı. The line of regression of 
xı on 2; is given by interchanging z: and zo in the above argu- 
ment. As indicated above, these two lines are the same as 
those that might be fitted to the distribution by the method of 
least squares. From Eq. (15) it is also shown that the standard 
deviation of zs for a given 2; (in other words, the scatter at any 
point of the line of regression) is independent of the selected 
value of z, for it is always equal to es 4/1 — 7°. 


NORMAL MULTIVARIATE FREQUENCY “SURFACE” 


When a bivariate distribution is described in geometrical 
terms, one of the dimensions can be used to measure the fre- 
quencies. This is not possible for distributions involving more 
than two variables. In the three-variable case, for example, all 
three dimensions must be used to indicate the variations in the 
variables themselves, and none is left to measure the frequencies. 

Resort is had in multivariate problems to the use of densities 
to measure frequencies. Such a device could have been used in 
the monovariate or bivariate case; instead of having the fre- 
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queney of any interval represented by the height of a rectangle 
erected on the interval, it could be assumed that the cases were 
represented by points on a line, and the more points crowded into 
any given interval on the line, Ze, the greater the density of 
points in the interval, the greater would be the frequency of 
that interval. Likewise, in the bivariate case, instead of repre- 
senting the frequency of cases in any given two-dimensional cell 
by the height of a rectangular pile of checkers set up on that cell, 
it would be possible to look upon the various cases as points in 
the two-dimensional plane; the frequency of points in any cell 
would then become the density of points in that cell. 

This is the device used to measure frequencies in the multi- 
variate case. For a trivariate distribution, for example, the 
various cases are looked upon as points in three-dimensional 
space, and the density of these points in any given three-dimen- 
sional cell becomes the measure of the relative frequency of cases 
in that cell. A trivariate frequency “surface,” if it may be so 
called, is in reality a trivariate density function. The same idea 
may be carried over by analogy to distributions of four or more 
variables, although no graphical representation can actually be 
made of such distributions. 

The properties of a normal multivariate “surface” or density 
function are merely generalizations of the properties of a normal 
bivariate surface. Whereas in the latter case, loci of equi- 
probability (i.e., loci of constant level on the frequency surface) 
were ellipses in the ziz=plane, in the multivariate case loci of 
equiprobability (i.e. loci of equal density in the N-dimensional 
space) are ellipsoids in the Xi, Xs, . . . , Xy space. A picture 
of a three-dimensional ellipsoid is given in Fig. 132. This repre- 
sents a contour of equiprobability for a trivariate normal distri- 
bution in which there is no correlation. Similar ellipsoids, some 
larger, some smaller, would represent other contours of equi- 
probability, and the whole distribution could be represented by a 
nest of such ellipsoids. The elliptical contours representing a 
high degree of probability are, of course, the contours close to 
the center, the center itself being the point of maximum prob- 
ability (maximum density). As one goes off from the center 
in a straight line in any direction whatsoever, the change in 
probability (density) is in accordance with the normal law. If 
the variables are measured in standard-deviation units, the 
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ellipsoids become spheres and the distribution becomes sym- 
metrical in all directions. 

When there is correlation between the variables, the ellipsoids 
of equiprobability becomes tilted with respect to the various axes 
and flattened out. If the variables are measured in standard- 
deviation units, the degree of tilting in any direction is directly 
related to the amount of the correlation between the variables 
concerned. The greater the multiple correlation between the 
variables, the narrower or flatter the ellipsoids become. In 


ae 
LY 


Fira, 132.—Ellipsoid of equiprobability for a trivariate normal frequency 
distribution. 


3 


the limit in which there is perfect correlation between all the 
variables, the whole distribution reduces to a line through 
the origin at an angle of approximately 54ł degrees (cos™' 1 Hen 
with all the axes (assuming the variables are measured in c units). 

As in the simpler case, a plane or hyperplane of regression 
represents the locus of the mean values ef one variable for various 
combinations of the other variables. For a normal multivariate 
distribution, the deviations from any plane of regression are all 
normally distributed with a constant standard deviation for any 
one plane. 

All the properties of a normal bivariate distribution thus carry 
over to a normal multivariate distribution, the only difference 
being that ellipses of equiprobability and lines of regression now 
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become ellipsoids and hyperplanes of higher dimensions. Basi- 
cally, the character of the distribution is essentially the same. 


NONNORMAL BIVARIATES AND MULTIVARIATES 


If a bivariate or multivariate distribution does not approach 

the normal form, much of the conventional correlation analysis 
loses its significance. In some cases, by taking logarithms or 
reciprocals a nonnormal distribution may be transformed into a 
normal distribution. In some instances, a multivariate dis- 
tribution may be normal with respect to its variations about the 
means of the rows and columns but the means of the rows or 
means of columns may trace out a curve of regression. In other 
instances, the regressions of the means may be linear, or planar, 
but the deviations around these lines, or planes, of regression 
may be either nonnormally distributed or normally distributed 
with varying standard deviations. 
f, in the case of two variables, the regressions are linear, the 
initial arguments presented for the use of the product-moment 
formula for r are still valid even for nonnormal distributions.” 
Large values of X; would still tend in general to be associated 
with large values of X» (or with small values if the correlation is 
negative), and a formula based upon the product deviations 
would give a good measure of the association between the two 
variables. If the distribution of eases around the lines of regres- 
sion is skewed, however, or if the standard deviation varies from 
one part of the line to another, the scatter about the lines of 
regression loses its significance as a measure of typical variability. 
Great care must be taken in these cases in using an average 
scatter to determinesthe degree of error in an estimate based on 
the line of regression. When the distributions are not normal, 
the rule that two-thirds of the cases tend to lie between plus and 
minus o;,; no longer holds, 

Finally, if the bivariate distribution is not normal, even the 
product-moment formula may cease to be a statistic of special 
significance in characterizing the distribution. In the normal 
case, if the two means, the two standard deviations, and r are 
all known, the bivariate distribution is fully determined. In the 
nonnormal case, other measures similar to measures of skewness 


1 See Chap. XV, pp. 377-396. 
* See Chap. XIII, pp. 338-353. 
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and kurtosis in the monovariate case may be of equal if not 
greater importance in deseribing the bivariate distribution. 
These considerations should always be borne in mind when r is 
used to measure correlation between nonnormally distributed 
bivariates. 

Similar statements may also be made about nonnormal multi- 
variate distributions. Here the higher dimensionality multiplies 
the possibilities of skewness, kurtosis, and other departures from 
normality.! 


APPENDIX 


DERIVATION OF THE EQUATION FOR THE NORMAL BIVARIATE 
FREQUENCY SURFACE, CORRELATED VARIABLES 


The normal bivariate surface in which Xi and X, are correlated is ellip- 
tical in form but rotated and narrower than the elliptical surface repre- 
senting uncorrelated bivariates. The distribution of the probabilities or 
relative frequencies is given by an expression of the following form: 


AP (t12) = ke Ma^ zi 2M ziw: 2) day das (16) 


in which the constants k, a^, h’, and b' may be evaluated in terms of the 
moments of X; and X». 
First it is to be noted that 


SIP (wiv) dz, das =1 (i) 
[jPGnzs)z dë drs =0 (ii) 
SJP(cree) ae dë dës = 0 (iii) 
SfP(ars)aj dë dës = ei Dei 
SSP (arra)a3 dai dr. = 03 (v) 
SP (erms) trae de daz = rove (vi) 


Equation (i) is true since the sum of all probabilities or relative frequencies 
is necessarily one. Equations (ii) and (iii) are true because zí and za 
represent deviations from the means of X;and Ka, Thus f [P (zi22)21 dz; das 
; " b h " d 

is equivalent to X z:, which equals zero. Likewise, 


J [ steier ats = EE 


e 


Equation (iv) is another form of N 22, which is equal to the variance of 
S E 3 z vx ae 

Xi; Eq. (v) is equivalent to E zi, which is equal to the variance of X;; 
x” oe Fi X) 

and Eq. (vi) is equivalent to 3 zizs which is equal to Tea, since 


r = Bfaite/Nows. 


1 For a more complete consideration of the problem of nonnormality, 
gee Smith and Duncan, Sampling Statistics, Chap. 18. 
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Second it is to be noted that, with reference to its rotated axes, aa’ and 
bb’, the equation of the ellipse representing the intersection of the fre- 
c 


queney surface by a horizontal plane at a distance e ? from the base plane 
is as follows: 
Ag, + Br, = C 

where a and z, represent the coordinates of a point with reference to the 
axes aa’ and bb’, that is to say, z; measures the perpendicular distance of a 
point from aa’ and x, measures the perpendicular distance of a point from 
bb’. Tf the areal element! dz; daz is also expressed in terms of the Gr, 
coordinates, it becomes dp dz» = dei dai," The whole probability function 
thus becomes 

dP(ziz;) = ke An Bed? dai dy, (17) 


But this is the form of a normal frequency surface for uncorrelated variables, 
so that, as seen above, pages 474-475, 


2 2 
E 
A= E B= (2) and k= 
2i 22, 7,092 


since there is no eross-product term, H = 0. 


1 ft will be recalled that dP(zizs) = F(awzs) dai dz» is represented geo- 
metrically by the volume under the surface F (x12) cut off by a hollow pipe, 
erected on a rectangle in the sıx plane, the sides of which are dz; and day 
(see Fig. 125, p. 475). To express the whole probability distribution in 
terms of the new zz? coordinates, the area of the pipe's base, da; dz» must 
be transformed into these new coordinates as well as the height of the pipe, 
F (z1x). 

* The transformation of coordinates is of the form 


zı = x, sin æ + z, eos a 

z, = T, COS e — T Sin = 
where æ is the angle that aa’ makes with the as-axis. Cf. FINE and THomP- 
son, Coordinate Geometry, p. 120. Since, in general, dz; dz; equals, within 

differentials of higher order, 

As 6x1 
Jan Aa 
day in 
bai óxi| 


dai dz, 


it follows that 
cos o sin a 


Eeer re) 
— gin o cos a dz, dz, = dz, das 


dx; daz = 


since costa +sin?a = 1. Cf. Wirsow, E. B., Advanced Calculus, pp. 
133—134. 
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The distribution function, Eq. (17), may therefore be written as follows: 


eu E 61 dz! dz, (18) 
Hos 


TT 
2393. 


AP (ait) = 


where oh and e are the standard deviations of the new variables x; and z. 
Tt will be noted that this transformation has not changed the probability 
of a given sıze combination but has merely expressed it in terms of a new 
set of coordinates. Accordingly, P(a%2) = P(x'2), where x, and 2$ are 
derived by a linear transformation from x, and za? 

Finally, it will be noted that in any equation of the second degree the 
product of the coefficients of the squared terms minus the square of one- 
half the coefficient of the cross-product term is invariant (that is, its value 
remains unchanged) under simple translations and rotations.1 Accordingly, 
the following relationships hold: 


AB — H: =a'b! — MI 
or since H = 0, 
AB = ah — h^ 


But inasmuch as A = (1/2))*, B = (1/72), it follows that 


AB = Be = ab! — MI 
gi 


From this it follows that 


Use will now be made of these relationships to derive the values of a’, b/, 
and W’. 
As noted above, since 
2 * 
Í if jon Maz bres +02) de, de, = 1 
=a J- 


then 


Tre Mta 
J J Eege de 
-n _ 2 


Se 


i If both sides of Eq. (19) are differentiated with respect to a’, it is found 
that 


* See footnote *, p. 493. 
1 Cf. FixE and THomPson, op. cit., p. 131. 


Coq —— SP £L 


= 
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1 Be Sg 1 xh! 
-3 f JI gre Malet +2N'nze+0'22) dz, dg, = — Tg 
1 up 


7 3 abr — (NS) 


By canceling out —} and multiplying the equation by k, the left side is 
equal to e [see Eq. (iv), page 492], and it is found that 


b 


TOE Se CEET 
= Lou — n] or b' = oi[a'b (h°) 


(a) 
If similar procedure is followed after differentiating Eq. (19) with respect 
to b’, it is found that 


a’ 


= wr 05] or a’ = olab’ — ONT 


(b) ei 
Tf both sides of Eq. (19) are differentiated with respect to h’, it is found that 


_ " Jee" rete dx, dr: 
1 2«(—2h') (=h') 


= 2 fab — (ER — Mey — (7l 
in which, if multiplied through by k, the left side equals —eic2ris [see Eq. 
(vi)|, and hence the whole expression reduces to 

ES 
or 
(c) M = —ewsnsla'b! — (i^) 


From Eqs. (a), (b), and (c), it follows that 


(d) pg EL 
, 
Ja — QV): = = Sege? 
d 


. 

Equations (a), (b), and (c) are three equations from which the values of 
a’, b’, and M may be expressed in terms of z cs, and r, The direet evalua- 
tion of a’, b”, and h’ from these equations is not a simple matter, however, 
and it is easier to proceed as follows: From Eqs. (a), (b), and (c), it is pos- 
sible to express b’ and h’ in terms of a’, as noted above in Eq. (d). It will 
also be recalled that 


= Vab — DI _ gier vi-r 


2x KE 


(e) k 
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By substituting equivalent values, Eq. (16) may accordingly be written 
as follows: 


— alos? f i? 
dei v = a A 


Zeus 


) dr dz» (20) 


dP(a es) = 


The double sum ffdP(zıx:) = 1, however, so that, from Eq. (20), it 
follows that 


— Rem Qnam e 
| if St yr e 2 NY G Sal dz, dt: = u (21) 
ne: a 


If both sides of Eq. (21) are differentiated with respect to a’, it is found 
that 


fou? (ru _ rims xs? 
oi een cil 


Multiplication of both sides by —a’ and expansion of terms then give the 
following: 
alert (at o mum rit 

ei DURS dai das 


wi? AË 


gei 
sa f f (ras J)e a'o Vi=7 =r SES SEI +) dzidza 
92 


Bc 


2 Pi SS gie (xi? ira ns 

v) a a V1 — r coge Vn P wis si 1 

+ — 23 ) ————_e 2 ker ow: cs dmi dz, = — 
2ei S Deco: WEEN 


But the left side is equal to 


e EH j 2ussdP (aves) -- 7 À f f area 


which, according to Eqs. (iv) to (vi), is SCH to 


gen = E = - (roses) Tus E eb = i-r) 


Hence, 
pe eee eS 
v Ee 
If this value of a’ is substituted in Eq. (20), it will give an equation in 
which all the parameters have been evaluted in terms of the moments as 
follows: 
1 1 (2n 


SET eur) 
d Zeeez V1 — rd won dada. (22) 


PART V 
Study of Dynamic Variability 


CHAPTER XIX 
INDEX NUMBERS 


One of the most widely used statistical methods is the proce- 
dure that gives rise to the summarizing or expression of data in 
the form of index numbers. It is an application to a practical 
problem of simple principles of ready comparison, principles of 
averaging to obtain summary figures, and principles of stratified 
sampling. Today, the method of index numbers is applied in 
five large fields, as follows: 

1. The measurement of the general price level, or the measure- 
ment of general exchange value. 

2. The measurement of groups of prices, such as wholesale 
prices, retail prices, or wages. 

3. The measurement of the general quantity of production or 
trade with indexes of physical production, trade, or employment. 

4. The measurement of the general volume of business or 
trade with indexes of the value of production or trade, or with 
so-called “barometers.” 

5. Miscellaneous, including a wide variety of uses, some of 
which are given below on pages 511-513. 

History of Index Numbers. General use of the device known 
as an “index number” to serve as a comprehensive method of 
summarization is of recent origin. Like most of the modern 
technique of statisties it'has been developed since 1900. But 
the fundamental idea is an old one. According to Warren and 
Pearson, as early as 1738 Dutot made price comparisons showing 
that a group of representative commodities cost twelve times as 
much in 1735 as they did in 1508.! In 1764, an Italian, G. R. 

1 Warren, Guorae F., and FRANK A. Pearson, Prices (1933), pp. 18-20, 
containing other interesting examples of attempts to measure changes in 
general price level prior to the middle of the nineteenth century. 
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Carli, attempted an investigation into the effect of the discovery 
of America upon the purchasing power of money; he constructed 
a very simple index number of prices, using only three com- 
modities, grain, wine, and oil. He combined the prices of these 
three commodities in order to compare their average level in 
1750 with the level of the same commodities in 1500.* The 
gold movement from the New World to European countries 
aroused speculation throughout the mercantile period with 
respect to the relationship between prices and the amount of 
money in circulation. Locke and Hume laid the groundwork 
for the statement of what is now known as the quantity theory of 
money. Speculations of the seventeenth and eighteenth cen- 
turies, however, with the exception of Carli's unusual attempt, 
were without the assistance of any measurement of the general 
price level. 

Concern about the problem was brought to a new height during 
the Napoleonic Wars, when prices were fluctuating widely; and 
again during and following the Greenback era in the United 
States the question of the relationship between the general price 
level and the money supply became associated with inflationary 
issue of paper money. In the decade preceding the Civil War 
the discoveries of gold in California served to arouse interest in 
the question of the effect of increased supplies of gold upon the 
general price level. 

Twentieth-century economists, ‘already interested in the 
quantity theory of money by reason of the accumulation of these 
historical experiences, were provoked to continued and diligent 
study by the development of the South African gold mines since 
the 1890’s, accompanying world-wide general rising prices until 
the First World War. During the First World War and the 
subsequent period of maladjustment, with countries all over the 
world alternately on and off the gold standard, speculation in 
monetary theory became of such general interest that the prob- 
lem preoccupied some economists almost to the exclusion of other 
fields of study. 

Meanwhile the statistical technique of measuring general price 
change by the index-number device was developed; by 1798, 

* MITCHELL, Wester C., “Index Numbers of Wholesale Prices in the 
United States and Foreign Countries," Bureau of Labor Statisties, Bulletin 


284, p. 7; cf. also reprint of Part I, “The Making and Using of Index Num- 
bers," Bulletin 656 (1938), p. 7. 
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Sir George Schuckburg-Evelyn formulated a plan for making 
index numbers of prices.! The efforts of the early statisticians in 
this direction were accorded but scant approval by the econ- 
omists, who were apparently suspicious of “political arith- 
metic.” Ricardo said that it is impossible to determine “the 
value of a currency” by its “relation not to one, but to the mass 
of commodities.”? Early in the nineteenth century mathe- 
maticians were more interested in the application of the theory 
of probabilities in the fields of astronomy, biology, anthropology, 
and geology. The great exponents of the developing technique 
in the application of statistical theory to the social sciences, such 
as Quételet, were busy with problems in the realm of ethics and 
morals; but about the middle of the nineteenth century came 
powerful support for the application of these principles to eco- 
nomie statistics. 

William Stanley Jevons claimed that the works of Quételet 
abundantly proved that many subjects in the social sciences are 
so hopelessly intricate that they can be analyzed only by the 
use of averages and by trusting to probabilities as the form of 
generalization. He constructed indexes of wholesale prices in 
order to measure the value of gold and invoked the theory of 
probabilities as justification of his claim that the rise in prices 
was connected with the change in the value of gold, saying that 
“the odds are 10,000 to 1 against a series of disconnected and 
casual circumstances having caused the rise of prices—one in the 
case of one commodity, another in the case of another—instead 
of some general cause acting over them all.” The general cause 
acting over them all was considered to be the change in the 
value of gold.* 

In 1887 Prof. F. Y. Edgeworth began a series of contributions 
to the problem of index numbers as a method of summarizing 
trends in price statistics. , He brought to bear upon the field of 
the social sciences the mathematical theory of probability. He 
saw clearly that it is a problem of applying a strictly a priori 


1“An Account of Some Endeavors to Ascertain a Standard of Weight 
and Measure,” Philosophical Transactions of the Royal Society of London, 
Part I, Art. viii, pp. 132-185; citation from Wesley C. Mitchell, Business 
Cycles—The Problem and Its Setting (1928), p. 191. 

2 Ibid., p. 193. 

š Ibid., p. 195. 
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theory to an analogous situation, but he insisted that the theory 
of probability applied.! Later, the theoretical application of 
probabilities to the problem of measuring social phenomena, and 
particularly the general price level, was taken up by C. M. 
Walsh, who published in 1901 a treatise on the measurement of 
the price level and later published a book entitled The Problem 
of Estimation, which further developed the application of prob- 
ability theory to economics.? 

Since about 1915 the important problem of the technique of 
index-number construction has been attacked by a number of 
scholars. Prof. Wesley C. Mitchell was a pioneer in the explora- 
tion of the technical problems involved and a major part of their 
solution; others have done important work of this character 
during recent years, especially the economists and statisticians 
in government or semigovernment agencies, such as the Bureau 
of Labor Statistics and the Federal Reserve Board. 

Interpretation of the problems involved in the making of 
index numbers may be facilitated by analysis of two of the main 
principles involved: (1) the concepts of absolutes vs. relatives, 
and (2) the application of the theory of stratified sampling to the 
particular problem of the making of an index number. 

Conversion of Absolute Numbers to Relative Numbers. 
Absolutes. An absolute is an expression of the number of things 
being considered, measured by an appropriate unit, as 1,000 
bushels of wheat or 50 acres of land. A simple absolute taken 
by itself is of little importance. The number. of people in a 
country is of no particular significance unless a comparison is 
desired, for example, a comparison with the natural resources 
of the country or with the population at some other point in time 
or in some other country. 

Prices are ordinarily conceived of as absolutes; that is, saying 
that the price of wheat today is one dollar a bushel refers to the 
objective thing, namely, the concrete one dollar. It is true that 
this particular absolute has a ratio aspect when it is thought of 


1 Persons, W. M., “Statistics and Economic Theory," Review of Economic 
Statistics, Vol. 7, (1925), pp. 185-186. Also cited in Wesley C. Mitchell, 
Business Cycles—The Problem and Its Setting, (1928), p. 197. 

2 The Measurement of General Exchange-Value, pp. 553-574, cited in 
Wesley C. Mitchell, “Index Numbers of Wholesale Prices in the United 
States and Foreign Countries,” Bureau of Labor Statistics, p. 9. 
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, measure of the value of wheat. But when thought of merely 
s one of the goods in an exchange, the dollar can rationally be 
considered to be an absolute. Prices accordingly are referred 
as "absolutes." Z 
Relatives. In tabular form a ready visualization is often 
ccomplished by converting absolutes to relatives of some selected 
base, For example Table 63 shows data on three important 
types of productive activity in the United States. 


Qanun 63.—ESTIMATED VALUE or SELECTED Types or PRIVATE Con- 
STRUCTION Activity IN THE UNITED STATES 


v a Ç N f: i- 
podes Farm construction | dential construction 
Years Gs 
Minions | se | Mite | Index | Jinan, | Indoxt 
al average 
1926-1929 640 100 468 100 4,066 100 
1932 78 12 125 27 641 16 
1933 128 20 175 37 314 8 
1937 391 61 360 77 1,530 38 
1938 192 30 345 74 1,515 37 
1939 200 31 340 73 1,860 46 
1940 337 53 360 7 2,077 51 
Source: Survey of Current Business, Vol. 21 (February, 1941), p. 21. 
* Each index is on the base, average 1920-1929 = 100. 


Considerable difficulty is encountered in obtaining a clear 
mental picture of the comparative changes in these three series 
by study of the absolutes themselves. Was the decline in new 
factory construction more severe in the 1932-1933 depression 
than the falling off in new residential construction? Did farm 
construction suffer more severely than new residential nonfarm 
construction? Such questions, involving comparative judg- 
ments, can be answered much more quickly if each series is 
converted into relatives or simple indexes upon a common base 
period. This is illustrated in Table 63, in the columns presenting 
the indexes with average 1926-1929 as the base. 

Simple index numbers, or relatives, of this sort involve the 
notion of comparing with unity. The mind more readily 
grasps expressions in round numbers than in odd numbers; it 
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further reduces mental effort if the round numbers are in mul- 
tiples of 10. From this fact arises the practice of relating prices 
or other quantity figures or absolutes of any kind to each other 
in such a way as to get a comparison based upon 1, 10, 100, 
1,000, ete. If based upon 1, they are called “proportions”; if 
based upon 100, they are called "percentages." They are all 
relatives, or indexes. Most commonly in the United States and 
in Great Britain and many other countries, 100 is used, although 
a few, notably Australia, use 1,000. 

Even where there is but one price series, it is simpler to com- 
prehend the significance of change if the absolute prices are 
converted to a relative form. For example, the changes in 
price of coffee per pound, as shown in Table 64, are easier to 
trace from period to period when expressed in relatives. Thus, 


TABLE 64.—Price or COFFEE 
Annual averages in New York market of No. 7 Rio coffee 
(In dollars per pound) 


Item Symbol 1926 1933 1934 1941 
Prip EE p 0.182 | 0.078 | 0.098 | 0.080 
Relative........... (100/0.182)p | 100 43 54 44 


Source: Bureau of Labor Statistics, Wholesale Prices (June and December issues of 
specified years). 


let 1926 be considered 100 and the prices in other years related 
toit. The arithmetic involved is simple in principle and contains 
two steps: (1) dividing the series throughout by the base selected, 
which may more conveniently be done by multiplying throughout 
by the reciprocal of the base and (2) multiplying by 100. This 
method, illustrated in Table 64, makes the figure for the base 
GE equal to 100, and the rest fluctuate as percentages of the 
ase. 

Another elementary idea is involved iit the making of relatives, 
and that is the reduction of nonhomogeneous sets of figures to a 
homogeneous base for purposes of comparison and to simplify 
interpretation of relative change among nonhomogeneous things. 
For example, the prices of coffee per pound at different times, 
the prices of canned peaches per dozen cans, and the prices of 
wheat per bushel are all three presented in Table 65 for com- 
parison with each other. 
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it is difficult to compare the price of coffee per pound with the 
price of wheat per bushel on the one hand and with the price of 
canned peaches per dozen cans on the other, as they fluctuate 
from time to time; but if all are changed to relative numbers, 
by the method already described, with 1926 as a base period, 
the comparison may easily be made. This is illustrated in 
Table 66. 


DABLE 65.—Prices or CorrEE, CANNED PEACHES, AND WHEAT! 


Item. 1926 1933 1934 1941 


E 0.182 0.078 0.098 0.080 
d peaches... EE 1.993 1.146 1.403 1.528 
MEO po PAPA WS ace 1.496 0.724 0.932 0.993 


^e: Bureau of Labor Statistics, Wholesale Prices. 
of canned peaches are annual averages, quoted in dollars per dozen cans; prices 
of wheat are of No. 2 hard, Kansas City, quoted in dollars per bushel. 


ieelatives Using a Base Period in a Time Series. Price relatives, 
and the relatives shown in Tables 65 and 66, illustrate relatives 
using a base period in a time series. Three fundamental pre- 
cautions must be observed in the use of such relatives. 


Tunn 66,—Price RELATIVES op COFFEE, CANNED PEACHES, AND WHEAT 
Average 1926 = 100 


Item | 1926 1933 1934 | 1941 
Coffee. ie ae d 100 43 54 44 
Canned peaches. .| 100 58 70 77 
Wheat... e ees dE | 100 48 62 66 


1. It is almost always advisable and sometimes it is necessary 
to know the absolute figħres as well as the relatives—else mis- 
interpretation or even misrepresentation is likely to result. 
A classic example of a use of relatives that produced misinterpre- 
tation and may perhaps have even been intended to be misrepre- 
sentation was the evidence presented in 1932 by some notable 
protectionist “statesmen” in the United States Congress. 
Following are some of the statisties they issued for public 
consumption: 
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TAnrE 67,—Somn or THE LARGE INCREASES IN IMPORTS DURING THE First 
8 Montus or 1932 
(In percentages) 


Percentage 
Commodity Inerease in Imports 
Cod and other salt and pickled fish from Denmark... 3,729.8 
Salmon, fresh or frozen, from Jan, 2,511.8 
Fish in airtight containers from Canada..........- 4,669.9 
Cheese from Denmark... ................ 136.3 
Wrapping paper (other than kraft) from Swede: 615.9 
Pig iron from Sweden.......... 181.0 
Pig iron from the United Kingdom DAMM 011.8 
Wool and other yarns from the United Ce: 221.2 


Long-staple cotton from Egypt and British ms 
but transshipped from the United Kingdom: 


Egypt 3.1 
British India. zc 1542811 
From Canada, fresh pork..... i51 25/9. 
Dried peas from New Zealand... 477.3 


The purpose of these statisties was to prove that a veritable 
flood of foreign goods was threatening to inundate this country, 
put out of business all its domestic producers, and lower the 
wages of domestic workers. But the statistics are not what they 
seem to be. Some of the items were so very small in the aggre- 
gate in January, 1932 (the date they began to increase according 
to the table), that they were not even listed in the extensive 
classified list of imports that is published monthly by the Depart- 
ment of Commerce. If an exceedingly small item is increased by 
1,000 per cent, it is still small. Each time it increases 1,000 per 
cent, it is only eleven times as large as before; 2,000 per cent 
means twenty-one times as large. In January, 1932, the amount 
of pig iron imported into the United States from Sweden and the 
United Kingdom combined was less than 460 tons, worth about 
$4,500, which, compared with United States domestic produc- 
tion, was a mere nothing. The imparts in January, 1932, of 
wrapping paper other than kraft from Sweden amounted to 
$2,025. The imports in the same month of Egyptian cotton, 
transshipped from the United Kingdom, amounted to $982. 
The last item is particularly enlightening; it will be noted that 
the specification “transshipped from the United Kingdom” is 
carefully made. Most cotton of this type comes to the United 
States directly from Egypt and is not essentially competitive 
with American-grown cotton. 
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2, The meaning of a percentage figure is often ambiguous, and 
study of its background is necessary before it can be properly 
derstood. An illustration of the misinterpretation of a per- 
centage figure can again be found in the arguments of American 
protectionists. When it is alleged that our tariffs are already 
too high, the protectionists like to reply. that they are not too 
high. To prove their statement they point to the fact that a 
large percentage of the imports are on the “free list,” that is, 
that a large proportion of imports into the United States are 
charged no duty at all. This argument sounds plausible, but 


its non sequitur quality becomes evident when it is realized that 
the tariffs on dutiable goods are so high that they are virtually 
excluded from entering; it is thus the virtual exclusion of certain 


dutiable imports that causes a large proportion of imports to 
be goods on the free list. If the entire 100 per cent of imports 
were on the free list, it would mean, not that the tariff was not 
high, but that the tariff was so high that none of the dutiable 
goods could come in. 

3. In a series of coordinate relatives, it is necessary to know 
the base and to specify it for the information of others: For 
example, death-rate figures are quite meaningless unless the 
comparison is known. The death rate may be expressed as so 
many deaths per 1,000 people or per 100 people, and the statisti- 
cian should indicate which comparison is made. Death rates 
for a given disease may be expressed as the number of deaths 
per 1,000 people afflicted by the disease rather than as the number 
of deaths per 1,000 people whether exposed to it or not; again, the 
nature of the comparison should be specified. 

In simple index numbers like those given as examples in 
Tables 64 and 66, it is essential to know that the base is 1926 
or the average of 1926-1929, as the case may be. This should 
always be indicated somewhere in the title or subcaptions of the 
table or in a footnote. * 

Presumption of Normality in the Selected Base. When a series 
of coordinate relatives is constructed by relating a series of 
absolutes to some selected base, the base tends to be regarded 
as the normal level. Indexes in the series greater than 100 are 
looked upon as above normal, or above par, and indexes less 
than 100 are looked upon as below normal. Since this tendency 
exists, it is always desirable to give study to the matter of 
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selecting the base. Is it in fact the one that is at normal level, 
or is one of the other absolutes of the series at normal level? 

For example, in the illustration given in Table 63, should the 
annual average of new factory construction, 1926-1929, be 
regarded as normal? Was there, on the average, a normal 
amount of new nonfarm residential construction in those years? 
It might reasonably be argued that taking 3 years as a base is 
better than taking only 1 year, because an average of 3 years 
might tend to offset extreme fluctuations and produce compari- 
sons that would tend to be better than if only 1 year were used 
as a base. Thus the average of the 3 years might be about 
normal for each of the three types of construction compared, 
whereas if only 1 year were taken one or the other of the three 
types might have had an exceptionally high or low year. 

On the other hand, it may be pointed out that the years 
1926-1929 covered a range of years in which a great construction 
boom reached its peak. Consequently, all types of construction 
were above normal in all three of those years; some writers claim 
this was the peak of the greatest and longest construction boom 
in history. Construction was at a high level such as it might 
not be expected to reach again for many years, at least if the 
length of construction booms is some seventeen years from peak 
to peak, as some say it is. It may therefore be argued that 
1937 would be a better base to take, even if only 1 year is used. 
In that year the general level of economic activity seemed to 
be nearer to a normal or equilibrium than any other year in 
recent history, and certainly nearer normal than the boom year 
of 1929. But the year 1937 would be a poor base year for strike 
statistics because of the great disturbances in the coal industry 
in that year. 

Selection of the base has an important effect upon subsequent 
judgments as to the trends of the three series. If the average 
1926-1929 is taken as the base, all three of the construction 
activities were still below normal in the year 1940, as the indexes 
in Table 63 show; but if 1937 is taken as the base, the 1940 level 
of new factory construction would be 86, the 1940 level of farm 
construction would be 100, and the level of new nonfarm residen- 
tial construction would be 136. If 1937 is considered normal, 
in the years 1926-1929 new factory construction averaged 63 
per cent above normal, farm construction averaged 30 per cent 
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above normal, and new nonfarm residential construction was 
166 per cent above normal. 

The data shown in Table 64 and the indexes presented in 
Tables 65 and 68 may also be used to illustrate the effect of the 
base selected. In Table 66 and Fig. 133, the year 1926 is the 
base and the prices of coffee, canned peaches, and wheat are 
each set at 100 in that year; subsequent years are indexed 
accordingly. From 1926 to 1933 the greatest decline occurred 
in the price of coffee, the next greatest occurred in the price of 
canned peaches, and the decline in the price of wheat was com- 
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Fic. 133.—Indexes of prices of wheat, canned peaches, and coffee. 1926 = 100. 


paratively the least. Their relative recovery was in the same 
order, and all three were below normal in 1941. 

But if it is considered that they were at normal levels in 1941 
so that all are called 100 for that year, a quite different picture 
is obtained, as shown in Table 68 and Fig. 134. If these prices 


Tape 68.—Price RELATIVES or COFFEE, CANNED PracHES, AND WHEAT 
1941 = 100 


Item 1926 1033 | 1934 1941 
228 98 122 100 
130 75 92 100 
151 73 94 100 


were normally related to each other in 1941, then in 1926 the 
price of coffee was more than twice normal, the price of wheat 
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was well above 50 per cent over normal, and the price of canned 
peaches was 30 per cent over normal. Moreover, Fig. 134 and 
Table 68 seem to indicate that it was the price of canned peaches 
that was farthest below normal in 1933; the price of coffee was 
only slightly below normal. 
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Fia, 134,—Indexes of prices of coffee, canned peaches, and wheat. 1940 = 100. 


For most comparisons, a year too remote in the past is not a 
desirable base. For a long time, 1913, or an average of the 
years 1909-1914, was looked upon as the best base period to use, 
because it was the last normal period before the First World 
War. The farm bloc in Congress continued as late as 1941 to 
insist that farm prices should be permitted to rise to the par 
that existed before the First World War; but in 1941-1942, as 
farm prices began to rise at a more rapid rate than other prices 
so that they passed parity, the farm bloc began to insist upon a 
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new definition of parity. The long survival of 1909-1914 as a 
base illustrates, not the general desirability of having a remote 
base period, but merely the power of the farm bloc. Ordinarily, 
general economie change over a 28-year period is sufficiently 
great to make such a base undesirable. 

In the 1920’s, accordingly, most comparisons came to be made, 
not with prewar 1913, but with the average of 1923-1925 or with 
the single year 1926; these years persisted as a base period much 
longer than might ordinarily be expected because the extreme 
decline of the depression of the early 1930’s made it difficult to 
select a new base period. Finally, however, as the years of the 
Second World War passed, the period immediately preceding it 
came to be regarded as the best base for current comparisons. 
u the early 1940’s the average for the years 1935-1939, or one 
of those years, began to be adopted as the base period.! 

Relative Parts of a Whole. A single absolute quantity is often 
divided into several parts, and these several parts are expressed 
as percentages or proportions of the whole. These are properly 
called, not “index numbers,” but simply “relatives,” although 
ihey could be referred to as “constituent relatives.” The term 
index numbers, used with strict propriety, refers to a series of 
relatives that is a composite of a more or less large number of 
series of relative numbers. The series of relatives may be com- 
bined to form a series of index numbers by any one of a number 
of methods of aggregating or averaging, as will be explained 
later in this chapter. Accordingly, in strict usage when an 
index is an average of relatives, the term relative should be 
reserved for the separate ingredients and the term index should 


1 Tn the Survey of Current Business, Vol. 22 (November, 1942), the index 
of prices received by farmers was still reported on the base of the average of 
1909-1914 prices, the index of wholesale prices was still based on the 1926 
average, and the index of retail prices was based on the average for 1923— 
1925; but the cost-of-living index was based on the average 1935-1939, 
and the indexes of the purchasing power of the dollar (wholesale, retail, 
and farm) were based upon the 1935-1939 average. The indexes of national 
income and industrial production were on ihe average 1935-1939 base. 
The index of some manufacturing data, such as orders, shipments, and 
inventories, were based upon the averages for the single year 1939. The 
Survey of Current Business, Vol. 22 (December, 1942) published the Bureau 
of Labor Statistics indexes of wage-earner employment and weekly wages in 
manufacturing industries, revised, with the average of the year 1939 as 
the base. 
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be used for the composite. Yet this distinction is often honored 
in the breach as well as in the observance, and the student must 
expect to find the term index used in place of relative. 

An important item ta remember in the use of constituent 
relatives is that a relative increase or decrease does not neces- 
sarily mean an absolute increase or decrease in the subgroup. 
The absolute of the subgroup may, indeed, move in the opposite 
direction from that indicated by the relative figures. Con- 
stituent relatives are useful when it is required to see clearly 
the relative changes. If absolute changes are desired, the raw 
data must be examined. Table 69 is an example of the use of 
constituent relatives. It reveals the necessity of attention to the 
absolute as well as the relative figures. 


Tapie 69.—DxaTH Rares PER 100,000 POLICYHOLDERS FROM SELECTED 


CAUSES 
Weekly premium-paying industrial business, Metropolitan Life Insurance 
Company 
Ro up ed ITE TTE i l, ili>—,@—— 
I 
Annual rate per Percentage distribution 
100,000* of specified causes 


Specified causes of death | 


| 1940 | 1941 | 1942 1940 1941 1942 


9501.6) 100 00/100 . 00/100 .00 
8 30.2 5.85) 6.10) 6,02 
6 5.4) 1.62) 1.387) 1.08 
.5 47.2) 14.01) 14,35) 9.41 
44.0 41.9 8.45) 7.94) 8.85 
0 
8 
9 
2 
1 


PA cine 
Diabetes mellitus. . . 
Appendicitis, . š 
Influenza and pneumonia, 
Tuberculosis (all a. 
Syphilis.. 

Cancer (all Morro): 
Diseases of the heart. 
Motor-vehicle DEE H 
Beien cache EST osea 


553. 
33. 


10.0} 2.33) 1.98) 1.99 
(102.2. 19.21) 18.74) 20.37 
236.7 43.89 44.39 47.19 
.2 21.3 3.24) 3.65 4.25 
6.7) 1.41) 1.46) 1.34 


E 


Source: Metropolitan Life Insurance Company, Statistical Bulletin, March, 1942, p. 11. 
* Policyholders, based upon first 3 months of each year. 
P 


In order to illustrate the necessity of presenting the absolute 
figures as well as relative figures when constituent relatives are 
used, Table 70 is drawn up with a few items taken from Table 69. 
Study of the percentage distribution shown in Table 70 would 
appear to indicate that the death rate from suicides increased 
between the years 1940 and 1942. Actually, it decreased. 
Merely its relative position became more important. The per- 


INDEX NUMBERS 511 


centage distribution, in the absence of attention to the absolute 
figures, would also lead to a tendency to exaggerate the rise in 
the death rate from automobile accidents. These misleading 
results are due to the change in the size of the totals for the 
respective years considered—from 107.8 in 1940 to 115.4 in 
1941 and to 80.8 in 1942. j 


TaBe 70— DEATH Rate PER 100,000 PonrcYHOLDERS FROM SELECTED 
CAUSES 

Weekly premium-paying industrial business, Metropolitan Life Insurance 
Company 


Porcentage distribution 
A et jv 
Annual rate of specified causes 


Specified causes of death 


1940 | 1941 |1942| 1940 1941 1942 


107.8/115.4/80.8/100.00/100. 00/100. 00 
74.5 79.547.2 69.11) 68.89| 58.56 
8.6 7.6 5.4 7.97|" 0.59| 6.70 
|| 7.5) 8.1] 6.7| 6.90 7.02) 8.31 
| 17.2 20.221.8 15.96, 17.50| 20.43 


` Per 100,000 policyholders, based upon first 3 months of each year. 


Especially when a small number of rates are being considered, 
as in Table 70, it is necessary to study both the rates and the per- 
centage distribution. Actually, the study of rates is required to 
answer the question: Is the rate from suicides greater in 1942 
than in 1941? Study of the percentage distribution of specified 
causes is required to answer the questions: In 1942 were motor- 
vehicle accidents a more important cause of death than influenza 
and pneumonia combined? Did motor-vehicle accidents become 
relatively more important from 1940 to 1942 as compared with 
the other specified causes? Important questions are answered 
by each of the sets of figures; what is necessary to avoid is the 
use of the wrong set of figures to answer a given question. 

Great Variety of Simple Index Numbers in Use. Hundreds of 
simple index numbers are in use, and the number has been 
increasing rapidly since the First World War. Indexes of the 
simple type illustrated in Tables 63, 65, and 68 exist for nearly 
every separate industrial activity, for thousands of prices, for 
retail sales, wholesale sales, inventories, consumption of certain 
types of goods, and for many other things related to economic 
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and social activity. Index numbers measuring production 
from month to month in a large list of industries have been com- 
piled and published by the Board of Governors of the Federal 
Reserve System and other agencies. Indexes of marketing of 
fish, dairy products, livestock, wool, and poultry and eggs have 
been compiled by the Bureau of Foreign and Domestic Commerce; 
and indexes of the marketing of cotton, fruits, grains and vege- 
tables, lumber, and other natural products are compiled by the 
same bureau. This bureau has also compiled and published a 
large number of simple relative figures for new orders and unfilled 
orders in a number of manufacturing industries, including iron 
and steel, paper, lumber, textiles; and another series of index 
numbers of commodity stocks of manufactured goods and of raw 
materials, such as chemicals, foodstuffs, metals, textile materials, 
and rubber products. These indexes are published currently in 
the Current Survey of Business by the United States Department 
of Commerce. In the League of Nations publications, indexes of 
world stocks of foodstuffs and certain raw materials are available. 

The United States Department of Commerce has recently 
begun the compilation and publication of indexes of transpor- 
tation for the United States. These monthly indexes include a 
combined index of all types of transportation, commodity and 
passenger, and also indexes by types of transportation, such as 
an index of air transportation and a combined index of intercity 
motorbus and truck transportation. The indexes are published 
monthly with the base period 1935-1939 = 100 and appear in 
the Survey of Current Business. This publication contains 
other illustrations of the many uses of index numbers. 

The use of either subordinate or coordinate relatives to aid 
in the interpretation of series of data does not involve the appli- 
cation of the theory of statistics or the principles of sampling, 
although the gathering of the raw data may have involved the 
use of the latter. The rules of comparability must be considered 
when numbers are converted into relatives, however, as indi- 
cated in the discussion above. When a whole series of these 
simple index numbers, or relatives, are combined into a com- 
posite index number, it is necessary to make application of the 

1The transportation indexes are described in the Survey of Current 


Business, Vol. 22 (September, 1942), pp. 20-28; Vol. 23 (May, 1943), pp. 
26-27. 
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theory of statistics. The principles of stratified sampling apply 
lo the construction of these composite index numbers. 

Index Numbers. Application of Sampling Technique. Index 
numbers are combinations of a large number of single series of 
relatives by some method of aggregating or averaging. In the 
field of prices, indexes of farm prices, of cost of living, of retail 
prices, of wholesale prices, of wages, and of exchange rates are 
come of the index numbers obtainable. Also, indexes of indus- 
‘rial production, of trade activity, of retail trade, and of 
employment are found in various sources. All these indexes are 
combinations of numerous series of relatives. 

From consideration of the various purposes for which index 
numbers may be used, it should immediately be apparent that a 
difficulty is involved. How, for example, is it possible to get 
together all the facts in the United States regarding all whole- 
aie prices from time to time, or all wages, or all retail prices, or all 
production or consumption activities? The answer, of course, 
is that it is not possible, or certainly not feasible, but that a 
sample of some kind must be used. When a composite is made 
up of several series, how shall they be weighted? Should they 
be eonsidered of equal importance, and if not how shall their 
relative importance be determined in making up the composite? 
It is upon the basis of the principles of sampling that such 
index numbers are justified. As Prof. Edgeworth once said, 
the task is to extricate from fallible observations a mean apt 
to represent the general trend of prices, wages, production, or 
whatever is being measured.’ 

The demonstration by eighteenth- and nineteenth-century 
statisticians, such as Süssmileh and Quételet, that a hitherto 
unsuspected regularity lay hidden in numerical data about 
social phenomena encouraged economists and social scientists 
in the belief that known Variations that had been measured might 
be fair samples of the more numerous unknown variations. 
Furthermore, the construction of a great variety of composite 
index numbers by different investigators using different methods 
has produced results of such consistency as to inspire confidence 
in their use.? 


1 Cf. Mrregguu, WESLEY C., Business Cycles—The Problem and Tis Setting 


(1928), p. 204. 
2 Of. Mrrcngur, WEstEY ©., “Index Numbers of Wholesale Prices in the 
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To grasp the significance of an index number, it is not sufficient 
to have reference only to the summary picture it presents. Just 
as in the case of an average of one frequency distribution, so in 
the case of index numbers, the distribution of cases is of great 
importance. An index number is really a series of averages 
based upon a series of frequeney distributions—one frequency 
distribution for each time period—of which the index number 
itself is an average of some sort. A study based upon this idea 
was made by Wesley C. Mitchell in his analysis of year-to-year 
fluctuations of the prices recorded in the wholesale price bul- 
letins of the Bureau of Labor Statisties, covering prices from 
1891 to 1918 and including 232 to 348 commodities. He found 
that the price changes from year to year formed a fairly sym- 
metrical frequency distribution each year, and hence he con- 
cluded that “when it can be shown that phenomena are 
distributed approximately in this fashion, their average can safely 
be accepted as a significant measure of the whole set of variations, 
since even the deviations from the average are then grouped in a 
tolerably definite and symmetrical fashion about the average.” 

Such an analysis seemed to establish as satisfactory the use 
of an average to summarize price change from year to year; but 
index numbers frequently extend over a considerable period of 
time so that the general level of wholesale prices of 1942, for 
example, is compared with 1926 as a base or with 1935-1939. 
Year-to-year fluctuations may oceur in a manner such that the 
average may be used to summarize; but what of change com- 
pared with some year more remote in the past? In order to 
test the reliability of the method of index-number construction in 
this regard, Prof. Wesley C. Mitchell applied the technique of 
taking several samples; in one sample he took 242 commodities, 
in another 50 commodities, and in a third sample 25 commodities 
at wholesale prices and constructed three sample index numbers 
for the period 1890-1913. He found tliat the results from the 
smaller samples were strikingly close to. those of the larger 
sample.? 


United States and Foreign Countries," Bureau of Labor Statisties, Bulletin 
284, p. 11. 

1 Ibid., pp. 17-18. 

? Ibid., p. 38. The theory of sampling errors does not apply in à way 
that makes possible mathematical tests from a single sample. 
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Stratified Sampling Method Applied. Others have also found 
that whenever the principles of stratified sampling have been 
followed in the construction of index numbers of wholesale 
prices, the results obtained are similar to the results obtained by 
the use of all available data. This inspired confidence in index 
numbers extending back through the years, for which fewer price 
series are available in published records, and at the same time 
inereased the belief that such an average expression of prices in 
the form of index numbers is a valid summary picture of general 
price change. 

It is upon the basis of the principle of stratified sampling that 
it is possible to measure by index numbers, such things as the 
cost of living, or the volume of production, or the general whole- 
sale price level. Also, it is upon the basis of the theory of 
sampling that eredence can be given to index numbers; in 
addition, it is due to this very fact that it is necessary to examine 
the constituent parts of an index number to be sure that it 
measures what it purports to measure and that it is applicable to 
any particular problem for which it is desired to use an index 
number. 

It is necessary to notice that stratified sampling is applied to 
the making of index numbers. For example, take the problem 
of measuring general price movement. This is not a case in 
which there is an infinite number of items, although the universe 
is a very large number, and the number for which data are given 
is probably less than the number for which data are unavailable, 
particularly in the case of retail prices or wages. In the case of 
wholesale prices the available data cover a larger portion of the 
universe. 

Not only is there not an infinite number of items, but the 
number of available items is often not a very large one. For 
example, some index numbers are based upon less than 50 
individual index-numbér series. However, the universe from 
which the items are taken is one concerning which a priori 
knowledge exists. According to such a priori knowledge, a 
representative sample can be obtained by a conscious or delib- 
erate proportional selection of ‘items from the various known 
strata of the universe. For example, it is known, in the case of 
wholesale prices, that the universe is made up of prices of foods, 
prices of metals, prices of forest products, prices of various raw 
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materials, prices of semimanufactured products in a number of 
fields that can be classified, and prices of final goods at whole- 
sale, to enumerate a few of the known strata of this universe. 
Knowing that such strata exist in the universe, the sample can 
be made proportional by a deliberate stratified random sampling 
procedure that would ensure proper representation in the sample 
of all the various strata known to exist in the universe.* 

Variety of Purposes of Index Numbers. As a historical propo- 
sition, the original all-pervading purpose of an index number 
was to measure general exchange value, that is to say, to explain 
the relationship between prices, in their general or average move- 
ment, and the value of money and credit. 

At the present time, however, a large number of general 
indexes of prices and other phenomena are currently published; 
but few even of the general price indexes purport to be a measure 
of the value of money. General indexes of retail prices, indexes 
of wages and pay-roll totals, indexes of prices of farm prices, 
metal products, manufactured goods, and raw materials, as 
well as general wholesale prices, are now available. Which of 
these price indexes really measures the value of money? 

Some statisticians and economists have held that a real 
measure of the changes in the value of money and credit should 
include, not only wholesale prices, but also wages, rent, and 
other prices, including retail prices and perhaps the prices of 
securities. Samples of each kind of price should be included in 
the index of prices that aims to measure general exchange value. 
On this theory, Carl Snyder, at that time statistician for the 


1 Cf. Kina, W. I., Index Numbers Elucidated, especially pp. 64-66. This, 
of course, often turns out to be a counsel of perfection in practice. The 
principle is based upon the assumption that in each of the strata designated 
the available data can be sampled successfully at random; and in practice 
this is often not true. For illustration, in gathering prices for an index of 
wholesale prices such subgroups of prices, or strata, as sulphuric acid and 
Portland cement are standardized, while house furnishings are not. From 
the point of view of obtaining the best possible results with the minimum 
amount of price gathering, and presumably with limited funds for the 
purpose, it would be sound practice to abandon the counsel of perfection 
and spend less money gathering prices of standardized articles and more 
gathering prices of nonstandardized articles. The resulting disproportionate 
amount of prices in the respective subgroups can then be countered by the 
required adjustments in the weights used to combine the series of relatives 
into index numbers. 
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Federal Reserve Bank of New York, compiled an “index of 
general price level,” including wholesale and retail prices, wages, 
rents, etc., but for certain reasons he excluded security prices. 
After careful study, these various components were given certain 
weights in the general composite. It should be pointed out that 
this index of general price level was originated for the special 
purpose of deflating data on bank clearings. Since bank clear- 
ings included payments for all these things, Snyder believed that 
an index of prices based upon these components could be used to 
cancel out that part of change in total bank clearings due to 
price change and obtain thereby an index of physical volume of 
trade. Even if it is granted that this index of general price level 
is valid as a deflator of bank clearings, it still remains a question 
whether or not it measures the exchange value of money. 

It could be argued with considerable force that such a general 
measure is impracticable because of the difficulty of getting 
adequate samples of rents, for example. And in any case, such 
a general measure of prices does not really give the measure of 
change in the purchasing power of money. ‘The general pur- 
chasing power of money may be a far more flexible and possibly 
sensitive factor than this general price average would indicate. 
A general price average would include an overweight of prices 
largely controlled by custom, or of prices in which resistance to 
change is very great for some other reason, as, for example, 
because of publie regulation, taxation, or their indirect: effects. 
The true measure of change in general exchange value may 
be more nearly approximated by the wholesale price index and 
perhaps even by the group of more sensitive wholesale prices. 

It is not the purpose here to carry this argument to a conclusion 
but merely to suggest its unsettled state. It may be significant 
that the Bureau of Labor Statistics has published reciprocals of 
its several indexes of prices—wholesale, retail, cost-of-living, and 
farm produets—as indexes of the purchasing power of the dollar 
in those respective fields. The question of how to measure 
general exchange value, or the purchasing power of money, con- 
tinues to be a controversial one. Meanwhile, index numbers 
continue to serve enormously useful special purposes whether 
or not collectively or individually they measure general exchange 
value. In his Treatise on Money, J. M. Keynes appears to 
suggest that the exchange should be looked upon as a number of 
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relatively noncompeting groups of markets and that there may 
be no such thing as a general purchasing power of money.' 

In the light of such theoretical difficulties hindering the 
proper measurement of the purchasing power of money by using 
reciprocals of price indexes, recent attempts have been made to 
construct indexes of the purchasing power of money by other 
means. One notable contribution is the index of purchasing 
power constructed by Murray Shields; this combines monetary 
data, viz., demand deposits, foreign deposits in the United 
States, foreign bank deposits in Federal reserve banks, volume of 
money in circulation, and cash in the vaults of commercial 
banks. 

Construction of Index Numbers. Principal Methods. Prof. 
Irving Fisher of Yale University, in a comprehensive study 
of the mathematics of index-number making, found several 
hundred kinds of formulas for calculating index numbers; but 
it is quite unnecessary to be disturbed by this fact, since as 
he himself says, only a few of them are of any value. There are 
two principal methods of calculating index numbers now in use 
and generally recognized as adequate for most purposes, but 
other methods are occasionally used and will therefore be 
described. The most commonly used are (1) the weighted 
average-of-relatives method and (2) the weighted aggregative 
method. Other methods sometimes used are (3) the simple 
average-of-relatives method and (4) the simple aggregative 
method. Various alternative ways of applying these methods 
are possible. For example, in the case of the simple average of 
relatives, sometimes the median is used instead of the arith- 
metical mean in order to avoid extreme variations; it is advisable 
to use the median especially for very small samples. "These 
methods will be taken up in the order (3), (1), (4), and (2), 
which is the logical method of treating them, rather than in the 
order of their prevalence in use, which is that given above. 

Simple Average-of-relatives Method. Referring again to the 
simple ease of the prices of coffee, wheat, and canned peaches, 
already used, perhaps the first method that would suggest itself 


: 1 Cf. also Bucksanr, B. H., The New York Money Market, Vol. 2, and 
KING, op, cit., pp. 189-216. 

2“ Measure of Purchasing Power Inflation and Deflation,” Journal of 
the American Statistical Association, Vol. 35 (1940), pp. 461-470. 
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to anyone desiring to obtain a summary figure representing 
average price change would be to add up the relatives and divide 
by their number, as follows: 


ABLE 71.—COMPOSITE INDEX NUMBER OF THE Prices OF COFFEE, WHEAT, 
AND CANNED PEACHES 


1926 = 100 
SS SSS eee i 
Commodity 1926 1938 1941 
Coffee, ie P* — 100 EE P Au 
Po Po Po 
Canned peaehes............ un = 100 SE un =77 
D D mm 
Wheat, Papa M100 | M-4 | Eh = 66 
— Po Po 2 Po 
Average, ZE 3)300 3)149 3)187 
100 50 62 


NENNEN EU J. Lu ESI 


The resulting composite index number shows that on the 
average these three prices fell to 50 in 1933 in comparison with. 
100 in 1926 and then rose on the average to 62 in 1941 in com- 
parison with 100 in 1926. Reducing this method to symbols, 
Let po, pu pa represent the prices of coffee. 

pi, pl, p, represent the prices of canned peaches. 

pi, pi’, py represent the prices of wheat. 
The relatives that appear in Table 71 are thus shown also in 
symbols. For example, the ratio pi /py corresponds to 66—in 
these symbolical presentations the multiple 100 is always **under- 
stood” and not actually written in the formula. The averages 
for the three would be expressed by symbols in Table 72: 


TABLE 72 


These averages are represented by the letter P, and when N 
commodity prices are averaged, instead of only three, for n years, 
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instead of only 3, the series of averages of relatives is repre- 
sented symbolically as follows: 


Po pi Pn 
Po Po po 


N Pa = N Lathan ae (1) 


Po = 


The capital N refers to the number of prices; and the small sub- 
seript n refers to the number of years, or number of time periods, 
which might be months or weeks as well as years. In general, 
the subseripts to the series of p's represent the time periods, and 
0 is assigned to the base time period, at which the relative equals 
100. The average of the relatives likewise equals 100 in the 
base time period. ‘The primes refer to different commodities. 

Weighted Average-of-relatives Method. The simple average of 
relatives involves the assumption that changes in the several 
prices to be combined are of equal importance; but this may not 
be true. Consequently, the idea of weighting the component 
price relatives in accordance with weights that are considered to 
reflect their relative importance has been developed. 

The weights are commonly based upon some rational con- 
sideration such as the quantities consumed in a given represen- 
tative year, the quantities produced, family budget figures, or 
some other criterion. Suppose, after considering all available 
information on the subject, changes in the price of a pound of 
coffee are considered thrice as important as changes in the price 
of a dozen cans of peaches and changes in the price of wheat per 
bushel are judged twice as important as changes in the price of a 
pound of coffee. Convenience of calculation will be attained 
if the numbers used as weights are so arranged that they will 
sum up to 1, 10, or 100, because the averaging process will then 
be a simple matter of changing decimal points in the sum of the 
weighted relatives. Such a manipulation of the quantities 
representing weights will have no effect on the final answer and 
will reduce the amount of work considerably if the problem is a 
long one involving, say, several years of monthly indexes. In 
the illustration used above, the weights are as follows: 


Coffee, 3 = w 
Canned peaches, 1 = w^ 
Wheat, 6 = w” 


INDEX NUMBERS 521 


A weighted average-of-relatives index number of these three 
commodities would be calculated as illustrated in Table 73. 
TABLE 73.—Inpex NuMBER OF THE Prices op CoFFEE, WHEAT, AND 


CANNED PEACHES 
Weighted average of relatives, 1926 = 100 


Commodity 1926 1933 1941 


Coffee, Ee 100 X 3 = 300| 43 X 3 = 129 | 44 X 3 = 132 


Canned peaches ...|100 X 1 = 100| 58X 1 = 58|77 X1 = 77 
Wheat; GSE ere 100 X 6 = 600 | 48 X 6 X 288 | 66 X 6 X 396 
Weighted average..... 10)1000 10)475 10)605 
100 47.5 60.5 


in symbolic language, the weighted average of relatives illus- 
trated in Table 73 is as follows: 


Be XE 
b Po Pu = E Pas- — P (2) 
zw 


Ew Zw 


Po = — 


instead of weighting by arbitrary weights, the actual quan- 
tities of the articles consumed or produced in the base year are 
sometimes used as weights, if such data are available. The 
quantities of the base year or base period are retained through- 
out, instead of getting the new quantities each year or each time 
period, for two reasons: (1) because it is difficult if not impossible 
to get quantity figures for every year and (2) because the pro- 
portions between these quantities are not likely to change 
greatly over short periods of time. If, after a given base period 
has been used for some time, it is discovered that one or several 
of the quantity weights are at variance with current conditions 
that seem to be likely to persist, the system of weights may be 
revised. In the various index numbers it constructs, the Bureau 
of Labor Statisties keeps continually on the watch for such 
changing conditions and when desirable changes the weighting 
system, 

The symbols for quantity weights are series of q’s, as follows: 


qo = quantity of coffee in 1926 
qı = quantity of coffee in 1933 
qz = quantity of coffee in 1941 


I 
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ll 


qi = quantity of canned peaches in 1926 
q, = quantity of canned peaches in 1933 
q, = quantity of canned peaches in 1941 


Wheat would be the same arrangement of a series of g""s. The 
resulting index number, using base-year quantities as weights for 
the relatives averaged, would be as follows: 


Po X pi > p: 
x qo Po qo Po » SR 


Ee EE 


Zqo E Š 


(3) 


Simple Aggregative Method. As suggested by its name, a 
simple aggregative index is the sum of the absolute prices, 
without first changing them to relatives. Thus the raw 
prices of coffee, canned peaches, and wheat for 1926, then for 
1933, and then for 1941 would be added together to give 
the index. This seems to be combining nonhomogeneous 
things, and it is; nevertheless, there is one famous and at one 
time widely used index that was based upon this method. 
Such was Bradstreet’s index of wholesale prices, which continued 
in use for many years. Following is an illustration of the method 
of Bradstreet’s index of wholesale prices:! 


Prices, Dollars per Pound 


0.0007 Connellsville coke, southern coke 
0.001 Bituminous coal, brick, iron ore 
0.002 Anthracite coal 

0.003 Salt 

0.004 Bessemer pig iron 

0.34 Aleohol 

0.50 Australian wool 

0.52 Quicksilver 

0.84 Rubber 


9.8530 The sum, which is the index 


According to this method, the index number of prices does 
not assume the form of a relative, but appears as follows: 


1 A good description of Bradstreet’s index is contained in W. C. Mitchell, 
“Index Numbers of Wholesale Prices in the United States and Foreign 
Countries,” Bureau of Labor Statistics, Bulletin 284, pp. 161-165. Other 
price indexes are also discussed in that source, such as Dun’s, Gibson’s, 
the Annalist, War Industries Board, Federal Reserye Board, and the 
Bureau of Labor Statistics. 
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TABLE 74.—BRabDsTREET's INDEX 


1933 | Index 1934 Index 
October. 2 Pe EE $9.0512 | January.... $8.8329 
November 8.8480 | February 9.0110 
December 8.8126 | March... eere 9.2627 


Phe index ean readily be converted into a series of relatives 
upon any chosen base; the Survey of Current Business published 
Bradstreet’s index converted into relatives, with the monthly 
average of 1926 = 100 until November, 1937, when compilation 
he index was discontinued.' 

Little rational justification can be mustered to the defense of 
such an index as Bradstreet’s, except that it worked well. Using 
approximately 96 commodities, it gave an index number that 
reflected. accurately the changes in wholesale prices, as tested 
by more elaborately conceived and compiled indexes of 
wholesale prices later introduced into the field. Bradstreet’s 
index was the pioneer in the history of price indexes in the 
United States, having been started in 1897. The conversion of 
all prices into prices per pound gives the effect of a concealed 
weighting, but no logical basis can be found for such a system of 
weighting. The symbolic expression of this index is as follows: 


ol 


Epo, Xp Epo, .. -  2Pn (4) 


When reduced to relatives and some base is taken as 100, it is 
as follows: 


Ep LIP. = Ep. 
Poo po Pou Dpo Pon Epo (5) 


While the concealed weighting system of Bradstreet’s index is 
accidental, or haphazard, depending upon the units in which 
goods are quoted, it has the effect of making the high-priced 
articles dominant. Its success as a good index of price change 
was due to the fact that there was a skillful or at least a pro- 


1 (Of. Current Survey of Business, Supplement, Vol. 18 (1938) p. 168. 
Monthly figures for the index are available from 1903 and annual figures 
from 1890. See 1932 Supplement, pp. 28-29 and 1936 Supplement, p. 15. 
Also see Bureau of Labor Statisties, Bulletin 173 (July, 1915). 
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pitious use of stratified sampling in the selection of the prices 
used. 

Weighted Aggregative Method. In making index numbers by 
the aggregative method, it is usually considered that weights 
are required, for the same reason that they are regarded as 
necessary in constructing an index number by the average-of- 
relatives method. The most reasonable kind of weight would 
seem to be the quantities of the several commodities produced or 
consumed or marketed. Such figures have become increasingly 
available since the time when such indexes as Bradstreet’s were 
originally conceived and developed. 

The last four or five decennial censuses of the United States 
have included more and more complete data on physical quan- 
tities of production and, more recently, data on retail and 
wholesale trade; and, in the years since the First World War, 
yearly figures have been available on physical quantities of goods 
in stock and physical production of some goods, through the 
activities of the United States Departments of Commerce and 
Agriculture. If it is assumed that the method of weighting is 
one that uses actual quantity figures, there are two methods of 
weighting the price aggregates in order to construct the index 
number. The first method is called “weighting by base-year 
quantities.” The second method is called “weighting by given- 
year quantities.” 

The desirability of weighting by base-year quantities has a 
twofold explanation: (1) In spite of the increased availability 
of quantity figures, there are still many commodities for which 
quantity figures are not easily available for every year; but a 
large number of such quantity figures, classified so as to be 
useful for weighting purposes, are available for the census years. 
(2) With few exceptions, the proportional changes in the quan- 
tities or value weights from year to year are not sufficiently great 
to cause large errors if these proportions are assumed to remain 
constant for several years in succession, Adjustments in the 
quantity or value weights can be made in the case of rapidly 
growing or rapidly declining industries, but the necessity for such 
changes within 10-year periods will not include a very large 
number of commodities. As a purely practical matter, the 
choice of base-year weighting instead of given-year weighting 
gives adequate results with much less statistical calculation, as 
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well as much less statistical research in seeking data for use 
as weights. 

United States Bureau of Labor Statistics. Construction of 
Index Illustrated by Practice of an Official Bureau. In the 
United States, the Bureau of Labor Statistics is one of the 
most important official compilers of index numbers of various 
kinds. From its publications can be illustrated how the various 
matters discussed above are brought into practice and how 
diligent must be the researcher, how alert the statistician, to 
new problems of weighting, sampling, and the like. 

[n 1943 the Bureau of Labor Statistics of the United States 
Department of Labor was compiling and publishing weekly, 
monthly, and annual index numbers of wholesale commodity 
prices. In a revision made in 1927, when the base period was 
changed from the 1913 average to the 1926 average, a new weight- 
ing system was adopted; it was then decided to revise the quan- 
tities used as weighting factors every 2 years, as the results of 
each new biennial census of manufactures became available. 
At the same time, the number of price series was increased from 
404 to 550. Another revision was made in 1931, when the num- 
ber of price series was changed from 550 to 784 and some rear- 
rangement of the items in the groups and subgroups was made, 
No change was made in 1931 in the method of calculating the 
indexes. In December, 1942, according to the Survey of Current 
Business, the monthly index of wholesale prices compiled by the 
Bureau of Labor Statistics was made up of 889 quotations. ! 

The weights used for farm products are based on averages 
for 3-year periods, changed every 2 years in order to keep the 
weights up to date. Thus, for the years 1932 and 1933, the 
weights used for farm prices were based upon averages of quan- 
tities marketed in the years 1927, 1928, and'1929; and for the 
years 1934 and 1935 the weights used for farm-products prices 
were based upon averages of quantities marketed in the years 
1929, 1930, and 1931. For all other groups of commodity prices, 
the weights used are averages of quantities produced for sale, to 


Vol. 22 (December, 1942), p. 8-3. On the 
ete., see Bureau of Labor Statistics, 
? Serial No. R1434 (Decem- 
the Wholesale Price Index 
” Serial No. R.666. 


1 Survey of Current Business, 
history of its compilation, weighting, 
Bulletin 181, 415, 453, 521; “Wholesale Prices, 
ber, 1941); “Revised Method of Caleulation of 
of the United States Bureau of Labor Statistics, 
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which has been added the average of imports for consumption, in 
the last two completed census periods. For example, for the 
years 1932 and 1933, the weights were based on average census 
data (plus imports for consumption) for the years 1927 and 1929, 
whereas for the years 1934 and 1935 the weights were based on 
average census data (plus imports for consumption) for the years 
1929 and 1931. In cases where census data are lacking, esti- 
mates are made of the quantities of the various commodities 
marketed, based on the best information available from govern- 
mental and reliable private sources; and these estimates are used 
as weighting factors. Commodities are added or dropped from 
time to time as they become important or cease to be important 
in the markets.* 

During the period of depression following 1932, when the 
data on manufactured output became violently disrupted, the 
weights based upon averages of the years 1929-1931 were 
retained. Most of the prices continued in 1943 to be weighted 
by the averages of the 1929-1931 census data, but for certain 
commodity groups new weights had begun to be based upon 
special studies of those groups. Thus, in April, 1941, the Bureau 
published a study of the “Wholesale Price Trends of Carpets and 
Rugs” revising its price series for this group. This study 
included new weights for the prices in this group, according to 
their “importance in the country’s markets in 1939.” 

The quantity weight used for each of the series, the unit in 
which each is priced, and the 1939 value of each item expressed 
as a percentage of the aggregate value of all carpet and rug items 
in the Bureau’s indexes are shown in Table 75. 

The use of data for 1939 departed from the “general practice 
of using the 1929 and 1931 data for weighting in the Bureau’s 

1Tf a “price” index, as contrasted with a "realized priee” index, is 
desired, it is necessary to keep the weights constant. In constructing a 
realized price index the weights may be changed, but in revising the weights 
the index must be calculated by using both sets of weights for the over- 
lapping year or period when the change is made. By “realized price” is 
meant the dollars covering the transaction, divided by the units involved in 
the transaction. When the lack of continuity of specifications makes it 
hard to define the commodity, as with automobiles, the dollars of sales for 
each general type (sedans, coupes, etc.) divided by the number of such 
units, in other words, the “realized price,” has received the endorsement 
of competent price experts as an acceptable quotation to use in price 
Statistics. 
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TABLE 75 
Se Unit in which E 
Price series priced Weight. 
Axminster 2 carpets........-...--- Lineal yard 7,077 
Axminster 9 X 12 rugs .| Each 2,015 


Plain velvet 2 carpets. .| Lineal yard 6,424 
Plain velvet 2? carpets...........- Square yard | 7,861 
Wilton 9 X 12 rugs................ Each 612 


Source: Bureau of Labor Statistics, mimeographed publication, “ Wholesale Price Trends 
of Carpets and Rugs,” April, 1941, pp. 16-17. 


wholesale price indexes,” in order to provide weights for the 
individual items that reflected their relative importance more 
nearly in accordance with present-day sales. The Axminster 
type has long been the most popular. The relative importance 
of Wilton carpets and rugs has increased considerably since the 
depression of the early 1930's, and they have regained much of 
their earlier popularity. Prior to the depression and before 
plain velvets became popular, the importance of Wiltons, on a 
dollar basis, was almost as great as that of Axminster carpets 
and rugs. During the depth of the depression, when consumer 
incomes were greatly reduced, there was a lessened demand for 
Wiltons, apparently because they were much more expensive, 
on the average, than Axminsters. 

The study of the carpet and rug price series is presented to 
illustrate the alertness of the Bureau in relation to the problem of 
compiling and publishing its indexes of wholesale prices. Its 
activity extends to other groups of price series as well. For 
example, beginning with January, 1938, the results of a survey 
of farm-machinery wholesale prices were incorporated for the 
first time in the Bureau of Labor Statistics general indexes of 
wholesale prices. In 1941 the Bureau began publishing weekly 
index numbers of waste ‘and scrap materials, carrying the index 
back to January, 1939. In “Wholesale Prices” (June, 1941), 
the Bureau published a monthly index of standard machine-tool 
prices, including 11 types of standard nonspecialty machine 
tools, carrying the index back to January, 1937. These new 
indexes are calculated on August, 1939, as a base; the monthly 
index of wholesale prices continued in 1943 to be based on the 
average of 1926 as 100. 
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Method of Computation Illustrated. The weighted aggregative 
method of computing an index number of prices is illustrated in 
Table 76, using only five price series. The five price series 
selected for illustration are those for carpets and rugs, for which 
the Bureau’s weights are shown in Table 75. A procedure 
similar to that illustrated in Table 76 is used by the Bureau, 
but with 889 price quotations instead of 5. 


TABLE 76.—Wonk SHEET ILLUSTRATING CALCULATION or Weicurnp 
AGGREGATIVE INDEX 
Base-period weights 


Average price Weighted price 
Commodity Weights 
1935-1939 1941 qo 1935-1939 1941 
Po Pı Poqo Pigo 
a) (2) [5 (4) © (6) 
Axminster $ carpets...... 1.567 | 2.014 | 7,077 11,090 14,253 
Axminster 9 X 12 rugs...| 22.745 | 27.936 | 2,015 | 45,831 56,291 
Plain velvet 3 carpets....| 1.772 | 2.356 | 6,424 | 11,383 15,135 
Plain velvet 3 earpets...| 2.581 3.266 | 7,861 | 20,289 | 25,074 
Wilton 9 X 12 rugs...... 40.007 | 50.521 | 612 | 24,484 | 30,919 
pA E et Mia reri ON EA S lec 113,078 | 142,272 
X(100/2pom)- | eme EES NES = 100.00| = 125.82 
Py = Zpigo/poqo 


Source: Currs, Jesss M., and SAMUEL J. Dennrs, “Revised Method of Calculation of 
Wholesale Price Indexes," Journal of the American Statistical Association, Vol. 32 (1937); 
also reprinted by the Bureau of Labor Statistics as Serial No. 666, 


Accordingly, the index number of the wholesale prices of 
carpets and rugs is 100.00 for the years 1935-1939, the base 
period, and 125.82 for 1941. The latter figure is obtained by 
taking Po; = 2piqo/Zpoqo. From the system of symbols already 
introduced, the symbolie presentation of this form of index 
number is as follows: d 


> > > Eph S 
Poo = m Po a Po ae Pog = š Dou (6) 


This is illustrated in the figures of Tables 75 and 76, the 1941 


. 142, 
index being See X 100 = 125.82. 
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Price Indexes and Quantity Indexes. Aggregative Index Using 
Given-year Weights. If given-year quantity weights are avail- 
able and used for computing an index, it must be noted that the 
following system of ratios would merely give an index of chang- 
ing aggregate values, without distinguishing which part of the 
change is due to price change and which part to quantity change: 


> > Epa 
Roo = GER eR eo Ree e 
Zpodo =podo =Pogo 
ZPndn 
Ron = > 
Ir Xpogo 


Such an index is an index of aggregate value, made up partly of 
changes in quantity and partly of changes in price. In order, 
therefore, to extract from it that part of the change which is due 
solely to price change, the base-year prices must be multiplied 
throughout by the given-year weights. This fact makes the 
given-year weighting method a very long one to calculate; it loses 
the advantage, inherent in the aggregative index weighted by 
base-year quantities, of having a constant divider in securing the 
index. In addition, the method of weighting by given-year 
quantities necessitates the two sets of cross products for each 
year—each year’s prices multiplied by that year’s quantities 
and by the base-year quantities. Following is the symbolic 
expression of the aggregative index of prices weighted by given- 
year quantities: 


> _ Epoo pigi prs Zp: |. 
Poo Epqn gie Sne ° Np. 
E 
EE En 
e Epod» m 


Index of Quantities Weighted by Prices. An advantage of the 
given-year weighting method is that an index of quantities 
weighted by prices can be obtained as a by-product, with com- 
paratively little additional calculation. The same numerators as 
those used in Eq. (7) can be used to calculate an index of quan- 
tity change weighted by given-year prices. For each year, 
given-year prices are multiplied by base-year quantities, and 
these aggregates are used as dividers. This will give an index of 
quantity weighted by given-year prices, as follows: 


530 STUDY OF DYNAMIC VARIABILITY 


— ZP% — Xn = EES... (gy 
Qo = pq Qu = pido Qoz EEN (5) 


Unfortunately, this advantage in the given-year weighting 
method is largely imaginary because the quantity data are not 
available soon enough for short periods of time to make it prac- 
ticable to construct monthly or weekly indexes. In any case, it is 
also possible to obtain a quantity index weighted by given-year 
prices, using the following equation, which would provide a 
much simpler method: 


yx > zs 
m) SE ATE x 9) 
Qoo SN Qor Spd Qoz 205 ( 


Quantity indexes are constructed, however, by other methods, 
usually with more general application of stratified sampling and 
with other weights than prices, largely because of the difficulty 
of obtaining quantity data. Not only are these other methods 
more convenient to calculate, but they make it possible to 
handle matters having to do with weighting and bias in the 
results. In using equations like Eqs. (8) and (9), it is often very 
difficult to appraise the inaceuracies due to bias inherent in the 
method. 

Quantity Indexes and Business Barometers. Indexes of 
Quantity of Trade or Production. Several statisticians and 
economists made attempts, especially in the years immediately 
following the First World War, to construct an index that would 
trace variations in the physical volume of production or trade. 
Pioneer efforts to construct such indexes, based upon scant 
material and with little in the way of a statistical theory to 
guide them, were made before the First World War by Wesley C. 
Mitchell, Irving Fisher, and Edwin W. Kemmerer. During the 
war and postwar period important progress was made, especially 
by Edmund E. Day, Warren M. Persons, and others. In 1923, 
the latter published an index of trade for the United States, 
beginning with the year 1903.* The index of production is 
based very heavily on the index of employment; it might there- 
fore fail to reflect properly the results of technological advance. 


* “ An Index of Trade for the United States,” Review of Economic Statistics, 
Preliminary Vol. 5 (April, 1923), pp. 71-78. Cf. also GAREIELD, FRANK R., 
“General Indexes of Business Activity,” Federal Reserve Bulletin, Vol. 26 
(1940), pp. 495-501. 
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After these experiments by pioneering individuals, several gov- 
cynment agencies and privately financed research organizations 
took up the task of developing indexes of trade and production. 
The most widely known and now most currently used index of 
industrial production for the United States is that compiled and 
regularly published in the Federal Reserve Bulletin by the Research 
Division of the Board of Governors of the Federal Reserve Sys- 
tem. This index is compiled from 95 individual series of monthly 
data, representing about 85 per cent of the total industrial pro- 
duction of the United States. The series include 22 durable- 
goods manufacturing industry series, 63: nondurable-goods 
manufacturing industry series, and series representing production 
of fuels and metals. This index is also regularly reproduced in 
the Survey of Current Business, published by the United States 
Department of Commerce.! A reproduction of the entire index, 
with its component parts, 1923-1940 by months, with the aver- 
age 1935-1939 = 100 as the base, can be found in the Federal 
Reserve Bulletin? 

It is characteristic of the indexes of physical volume of produc- 
tion or trade that they consist of combinations of various series 
upon the basis of stratified sampling, the weights for the repre- 
sentative series being devised upon a priori knowledge concern- 
ing the importance of certain groups of activity in relation to the 
whole of business activity. These indexes treat the separate 
series statistically before putting them together. For example, 
they remove seasonal variation and trend from the separate 
series and thus average together the cycles of the various separate 
series into the composite. The method of averaging employed 
is generally the aggregative method, although since 1940 the 
Federal Reserve Board uses an average of relatives weighted by 
quantities so that the final result is equivalent to what would be 
obtained by using the weighted aggregative method 3 

1 Federal Reserve Bulletin, Vol. 13 (February, March, 1927), Vol. 17 
(February, September, 1931), Vol. 18 (March, 1932); for adjustments made 
necessary by the 1942 world war, see Vol. 27 (1941), pp. 878-881; cf. Survey 
of Current Business, Vol. 20 (1940), pp- 11-17. 

2 Vol. 26 (1940), pp. 825-882; see also “Answer to Critics of the Index,” 
pp. 1047-1049. See also Wooptrer, THoMas, and MAXWELL R. CoNKLIN, 
“Measurement of Production,” Federal Reserve Bulletin, Vol. 26 (1940), 
pp. 912-924. 

š See Wooprrer and CONKLIN, op. cit. 
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Computation of Weights Illustrated. The index of manufac- 
tures published by the Federal Reserve Board is weighted by the 
total value added by manufacture in the ease of all manufactur- 
ing industries, and the index of mineral production is weighted 
by value of mineral products. The sum of these two is the 
index of production. The individual production series of which 
the manufacturing index is composed are weighted, as nearly as 
possible, according to the same principle. Accordingly, the 
total value added by manufacturing industries in 1937, as 
reported by the United States census, was distributed among the 
16 groups represented in proportion to the value added for each 


TABLE 77.—RELATIVE Importance OF Inpustry GROUPS AND SELECTED 
INDUSTRIES INCLUDED IN THE FEDERAL Reserve BOARD INDEX 
OF INDUSTRIAL PRODUCTION 
(Per cent of total with 1937 weights) 
Series 1937 Weights 
dudüstrialprodnotionos EE E TSA sar k. 100.00 
Manufactures,....... 
Durable manufactures 
Iron and steel...... 
Machinery production... 
"Transportation equipment............... 
Nonferrous metals and their product š 
Lumber and its produets...................... 
Stone, clay, and glass products. . 
Nondurable manufactures... eiiis ess visse 46.87 
"Textiles and their produets. ...,..,. 0an. 11:22 
Leather and its products. . 
Manufactured food produci 
Alcoholic beverages... .. . 
Tobacco products 
Paper and its products........ 
Printing and publishing. . 


Products of chemicals. 
Rubber products... 
Minerals 


! For industry series in which census data on value added by manufacture 
are not available other criteria had to be used, such as total value of manu- 
factured product, raw materials consumed, or man-hours worked. See ibid., 
pp. 917-918. 
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group, and then derived group totals were subdivided among 
industries and finally among individual products in a similar 
manner. Each individual series is thus assigned a hypothetical 
value-added figure, which is then divided by the relative of the 
1935-1939 compared with 1937 in order to convert 1937 value- 
added figures to 1935-1939 base. The derived 1935-1939 figures 
for each series are then expressed as percentages of their own 
total to obtain the weights. These percentages represent the 
estimated relative importance of each series in the 1935-1939 
base period and are the weights applied to the relatives in 
combining them into the index of production. Table 77 repro- 
duces a summary of these weights. 

Using the weights shown in Table 77, the equation for the 
Federal Reserve Board’s index of industrial production is as 


follows: 
Pon = 3 (w e) (10) 
in which 
Pargo 
Zpa:qo 


ps; represents the value (or value added) per unit of output in 
the weight-base period. 

Barometers, or Indexes, of General DUNG Conditions. Some 
composite indexes purport to be barometers or indexes of business 
and trade in general. These indexes are of two types: (1) A 
single series is sometimes believed to be a barometer of general 
business conditions and (2) a number of indexes of trade activity 
are combined in order to measure general business conditions. 

Of the first type, the most, prominent one at present is prob- 
ably the index of eleetrieal-power production, which is compiled 
from quantity figures published by the Geological Survey. The 
index of activity in thé steel industry at one time was looked 
upon as a good barometer of general business conditions because 
so many industries are dependent upon steel or steel products. 
The trends in the average of security market prices are sometimes 
taken as a barometer of coming business conditions, or at least 
as a measure of existing conditions. In wartime, the security 
markets often reflect conditions and war information that are 
not generally publicized. 
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A good example of the second type of index of general business 
conditions is that published currently by the New York Times and 
formerly by the Annalist. This index has been widely used 
and until October, 1937, was reproduced in the Survey of Current 
Business, published by the United States Department of 
Commerce. 

A more comprehensive example of this second type of index 
of business conditions is one that has evolved from the pioneer 
work of Carl Snyder, whose procedure was based upon the 
theory that the fluctuations in total bank clearings are made up 
of two variables: (1) price change and (2) change in physical 
volume of trade. By constructing a price deflator, which has 
already been discussed, and then by using this deflator to cancel 
out from aggregate bank clearings that part due to price changes, 
he sought to obtain an index of physical volume of trade for 
the years 1875-1924. Modifications and refinements were made 
in the construction of this index by Leroy M. Piser, so that it 
was known as the Snyder-Piser index of volume of trade for the 
United States. It included 89 series, classified as follows: pro- 
ductive activity, 46 series; primary distribution, 13 series; 
distribution to consumer, 8 series; financial activity, 6 series; 
general (such as life insurance, postal receipts, eleetrieal-power 
corporations, farmers, and eommunieation), 5 series ; and finally 
debits outside New York City. Thus it came to be based upon 
the principle of stratified sampling. ‘This index of volume of 
trade and production is published monthly by the Federal 
Reserve Bank of New York in its Monthly Review of Credit and 
Business Conditions.? 

Various forecasting services compile their respective indexes of 
business conditions according to their particular interpretation 
as to what should best be included in such an index and how 
best to weight various factors. Carefully worked out indexes 
of the marketing of farm products and forestry products are now 
available as a result of the efforts of the Bureau of Agricultural 
Economics in the United States Department of Agriculture. 
These are reproduced in the Survey of Current Business. The 


1 SNYDER, Cart, “A New Clearings Index of Business for Fifty Years,” 
Journal of the American Statistical Association, Vol. 19 (1924), pp. 329-335. 

2 Jounson, Norris Q., “New Indexes of Production and Trade,” Journal 
of the American Statistical Association, Vol. 33 (1938), pp. 341-348. 
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Bureau of Foreign and Domestic Commerce also publishes 
indexes of domestic commodity stocks, and of world stocks of 
certain outstanding industries, which constitute good barometers 
of the related business conditions.* 


ADJUSTMENT OF INDEXES TO BENCH MARKS 


Ideal Conditions for Stratified Sampling Nonexistent. In order 
to produce results approximating those of true random sampling, 
conditions favorable to random sampling in the subgroups or 
strata must exist. For one thing, this means that there must 
be large numbers of items from which to draw within each sub- 
group. It also means that the number of sample items must be 
sufficient to avoid the disadvantages of small samples. The law 
of large numbers must be given opportunity to produce the 
results of true random selection within each stratum; such is the 
sine qua non of truly successful stratified sampling. Under such 
conditions the method of selection causes no accumulation of 
bias. Such ideal conditions do not exist with reference to any 
known index number, not even the Bureau of Labor Statistics 
index of wholesale prices, which contains a total of more items 
than any other index. 

Nevertheless, the pattern of stratified sampling can with 
considerable advantage be adopted as the guide to procedure in 
the construction of indexes of all types. Following this pattern 
the investigator first works out. a system of classification of the 
data for which an index is to be compiled. Using the subgroups 
of this classification, he can then proceed according to the prin- 
ciples of stratified random sampling so far as it is possible to do 
so. When he finds that conditions ideal for random sampling in a 
subgroup fail to exist, the investigator must resort to subjective 
means to secure results that he believes will be representative. 

Inasmuch as all indexes contain, in some part, data that have 
been collected and proeessed by the use of such subjective 
methods, employed in the absence of ideal sampling conditions, 
it is desirable wherever possible for the statistician to find bench 
marks with which he can compare the results of his sampling 


1 Survey of Current Business, Vol. 20, Annual Supplement (1940), pp. 
88-164. For a more complete discussion of barometers of general business 
conditions see Wesley C. Mitchell, Business Cycles—The Problems and Its 
Setting, (1928), pp. 291-330; Joseph L. Snider, Business Statistics; and 
Garfield, op. cit. 
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procedure. A common-sense appraisal of the over-all result is 
the most generally used bench mark to judge whether or not the 
results are satisfactory; but this method presupposes an unusual 
amount of a priori knowledge and of scientific critical judgment 
on the part of the statistician. Sometimes more objective bench 
marks may be found to aid the statistical worker along his 
thorny path, These will be illustrated in the ensuing sections. 

Reasons Why Indexes Require Adjustment. The reasons why 
indexes require adjustment to bench marks do not necessarily 
arise from faulty application of the method of stratified sampling. 
They arise from the nature of the universe from which the sample 
is taken. In connection with most types of data collected for 
the construction of indexes, the universe is a discrete rather than 
a continuous one; in other words, the universe consists of com- 
paratively small numbers of units, each of which constitutes a 
comparatively large proportion of the whole universe. Often 
they cannot be considered as representative of each other. 

When such a comparatively small universe is subdivided, in 
order to apply the stratified sampling technique, the strata con- 
stitute universes with still smaller numbers. Added to this is 
the usual fact that only a portion of this remaining small num- 
ber is accessible to the data collector; in some cases, unavoidable 
bias itself constitutes a part of the reason for the accessible por- 
tion. Under such circumstances it is almost impossible to 
realize the essential condition of randomness of selection in the 
respective strata, and consequently stratified sampling technique 
gives less satisfactory results. 

Such is the situation with respect to sample data collected 
from business firms, especially manufacturing enterprises. In 
some of the subdivisions, corporate enterprise is on so large a 
scale that only a few firms represent a large portion of that 
stratum. In all subdivisions, the size of the sample return 
measures, not only changes in the trends it is desired to measure, 
but also success or failure of the collecting agency in persuading 
firms to report. The statistical technique of comparing iden- 
tical firms from month to month reduces but does not altogether 
obviate the cumulative error resulting from this weakness, 

In addition, growth of an industry, and hence growth in pay 
rolls, in output, in stocks of materials, or in whatever is the sub- 
ject of investigation, occurs not only in existing firms; but part 
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of the growth is in the rise of new firms in the industry. In some 
strata, perhaps the steel and machinery industries, expansion or 
contraction of existing firms (and hence those reporting in a 
stratified sample) may accurately reflect proportionately a rise 
or fall in business in that strata. But in other strata, like some 
branches of the textile industry or the food industry, the expan- 
sion or contraction of existing firms (and hence those reporting 
in a stratified sample) may not at all reflect proportionately the 
rise or fall in business. 

In heavy industry, where plant and equipment constitute a 
large proportion of the business investment, cyclical changes of 
the sample might well be much greater than cyclical changes in 
the universe.. This would follow if the large firm, with heavy 
investment in plant and equipment, tended to curtail production 
instead of lowering prices when faced with declining business 
prospects. 

In some branches of the clothing and food industries, in which 
small investment in plant and equipment and large numbers 
of small firms predominate, cyclical changes may be smaller in 
the sample than in the universe. The birth of new firms or 
resumption of activity by old firms is the principal manner of 
expansion in such strata. The death of old firms is the principal 
means of decline. The reporting firms are quite likely to be the 
ones that would not die at a rate so rapid as the average rate in 
the industry. 

Circumstances like those just described constitute only an 
illustration of the type of problem facing the statistician, who 
must continually endeavor to improve the sample of reporting 
firms. Great as his efforts and ingenious as his imagination, 
may be the resulting sample is likely to show bias. 

For bench marks in connection with adjustment of indexes 
of employment and pay rolls, statisticians have made use of the 
successive issues of the eensus of manufactures, often called the 
Biennial Census of Manufactures, which appeared in 1914, 1919, 
and each odd year thereafter, including 1939. For years after 
1939 it should be possible to get similar bench-mark data from 
the records of the Social Security Administration. 

By using the census-of-manufactures data as bench marks, it 
has been possible to check up on the monthly or weekly sample 
results obtained by the sampling process and to adjust them for 
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any bias that is disclosed by such a check. As each new set of 
manufacturing census data became available, which was about 
two years from the time it was taken, such indexes could be 
adjusted to the census data for errors that had accumulated since 
the last census. In the meantime, the results of the sample were 
relied upon as the best available information; and at the same 
time subjective sampling procedures were continually studied 
with a view to improvement wherever possible. To this end, 
the adjustment procedure often discloses areas in which the 
sampling results are especially in need of improvement. 

The method of making such adjustment will be illustrated, 
not so much as a valuable statistical device in itself, but as an 
example to the student of the care and attention to form, pro- 
cedure, cross checking, and the like, required of a good statis- 
tician. Thus the following instructions, together with the form 
used, are presented as an exhibit to help the student visualize 
how a statistician plans his work and works his plan. 

Method of Adjustment Illustrated. A good example of an index 
adjusted to United States census bench marks is the monthly 
index of pay rolls and employment published by the Bureau of 
Labor Statisties. The method of adjustment is reproduced by 
permission of the Bureau of Labor Statistics and is applied to a 
monthly index of pay rolls in the metal stamping, enameling, 
and japanning and lacquering industry of New Jersey.! The 
adjustment is carried out on Form BLS 1238, June, 1940, pre- 
sented here in Table 78. 

The raw data, which have been adjusted for the 1937 and 
earlier census figures or bench marks, but remain to be adjusted 
for 1939 census data, are entered by months in columns (3), 
(9), and (13); the sums and averages (S and I) for each of these 
columns are then entered. In column (17), using the lower part 
entitled “Formula if L is not available," enter the United 
States census figure for 1937 and 1939 (Z, and Zs). Calculate the 
ratio Zs/Z;, and enter in the space provided, therefore, 0.933280 

1 The work on New Jersey data was done by a Work Progress Administra- 
tion project sponsored by the New Jersey State Labor Department for the 
construction of monthly indexes of pay rolls and employment in manufac- 
turing industries, January, 1923-Deceniber, 1940. One of the authors was 
called upon to serve as consultant and director of the project. 


7 The part labeled “Formula if L is available” is used with a blanket 
adjustment method involving several census periods, 
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in Table 78. Copy Sı from column (3), in the space provided 
in column (17), that is, in Table 78, 883.82; this number mul- 
tiplied by the ratio Z3/Z; equals Sj, entered in the space provided, 
in Table 78, that is, 824.85. 

In column (13) calculate Rs by finding ZnJ for the year 1939 
[the n for each month is found in column (12). ZnI = January 
nI + February nI + March nI - - - + December nI, includ- 
ing an n for each of the 12 months. 

In column (18) enter Sj, copying it from the last row of 
column (17). Enter Sa, copying it from column (13). Subtract 
S from S; and enter the difference in the next row of column (18). 
Copy Rs [from column (13)] in the next row of column (18). 
This value, Rs divided into the figure in the preceding row, 
Sj — Ss, gives the value of d. Enter d in the last row of column 
(18). 'This is the adjustment parameter. It is now used to 
adjust the series by months as follows: 

In columns (4), (10), and (14) enter 1 + nd for each month. 
These values should be obtained on a calculating machine as 
follows: Put 1.000000 in the machine, and add it. Put d on the 
keyboard, being careful to place it correctly for the decimal 
point. Subtract once, and record the answer for 1 + nd in 
January, 1937. Subtract twice more (making —3 altogether), 
and record the answer for 1 + nd in February, 1937. Subtract 
twice more (making —5 altogether), and record the answer for 
1--nd in March, 1937. Subtract once more (making —6 
altogether), and record the answer for 1 + nd in April, 1937. 
The values for 1 + nd in May, June, July, and August are the 
same as 1 + nd for April, March, February, and January, 
respectively. They ean be found by reversing the above process 
on the ealeulator until 1.000000 remains in the machine and d 
on the keyboard. For September add d once and record 1 + nd 
for that month. Add d four more times (making 5 altogether), 
and record 1 + nd for October. By following a similar pro- 
cedure, guided by n in columns (2), (8), and (12), values of 
1 + nd are calculated and entered for each month through 
December, 1939. 

Enter in columns (5), (11), and (15) the indexes in columns 
(3), (9), and (13), multiplied, respectively, by the 1 + nd for the 
corresponding month in columns (4), (10), and (14). Add for 
each year, and enter sums, which equal K, S5, and Sj. Divide 
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each of these by 12, and enter the quotients in the next row. 
Thus, in Table 78, K = 883.88, S; = 654.19, and S; = 824.84; 
and divided by 12 these become 73.66, 54.52, and 68.74. 

In column (16) enter the S; and Ky, as indicated in the table; 
subtract K from S;, and enter the difference. Divide this dif- 
ference by 40, and enter the quotient in the next row. This 
figure is h, the parameter for the second adjustment. If 8; — K 
is smaller than 0.05, regardless of sign, do not calculate h. In 
the problem illustrated, S; — K = —0.06; hence h is calculated 
and found to be —0.002. 

If h is calculated, enter in column (6), for each month, mh; 
that is, in January enter h, in February enter 2h, in March enter 
8h, in April enter 4h, in May to August, inclusive, 5h; thereafter 
declining each month, with 2h in November and h in December. 
Enter in column (7) the sum of the figures for the respective 
months in columns (5) and (6). The sum of column (7) is equai 
to St. If his not used, the sum of column (5) is taken as S1; that 
is, if h is ignored, K = Si. 


CHAPTER XX 
RATIONAL BASIS OF THE ANALYSIS OF TIME SERIES 


Elements of Variation in Time Series. The elements of 
variation contained in an ordinary time series may be illustrated 
by building up a hypothetical time series. 

The first element in the time series is BET —] 
long-time growth, or trend. People living 
in the twentieth century are accustomed 
to the idea that things grow, or progress. 
Table 79, column (1), shows years and 
months for 3 years, and column (2) shows 
a set of figures that grow at the constant 
difference of 0.2 per month. This column 
of figures is plotted in Fig. 135 (A A’) and 
is a picture of the growth, or trend, in 
the hypothetical time series. 

Time series are also likely to have 
seasonal variations. Many economic and 
social phenomena vary from season to 
season in a similar manner each year. 
This is most evident in the case of 
activities affected by weather, such as 
agricultural production; but such patterns 
of seasonal variation occur in other events 
as well, Suppose the seasonal variation 
in the hypothetical time series is such that 0 gan 219447 S 
November is usually 58 per cent above ` ua. 135 Two of the 
the average month, July is usually only component parts of a 
43 per cent as large as the average hypothetical ione pea 
month, etc., as indicated in Table 80 = assumed trend, modi- 
showing the index of seasonal variation S RAE. EE 
for the hypothetical time series. 

Figure 136 is a graph of this seasonal variation as it occurs year 
after year, 1943, 1944, and 1945. 
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"TABLE 79.—H vrornETICAL Time Series Burrr Ur 


a) (2) (3) (4) (5) 
SUR E GRE Growth or | Version | The eyele | The eyele 
put in 
1943: 
dE 1.0 1.45 100 1.45 
DOE e hs cased ae: 1.2 1.49 101 1.50 
MED sam fe tr Eet, 1.4 1.41 102 1.44 
April. 1.6 1.33 103 1.37 
May. 1.8 1.19 106 1.26 
June. 2.0 1.04 109 1.13 
July.. 2.2 0.95 112 1.06 
August 2.4 1.01 115 1.16 
September 2.6 2.42 120 2.90 
October.... 2.8 4.00 122 4.88 
November............ Pr 3.0 4.74 124 5.88 
Jyceember, 5-1 TTA 3.2 5.02 125 6.28 
1944: 
DE EE 3.4 4.93 126 6.21 
February... 3.6 4.46 130 5.80 
3.8 3.84 140 5.38 
4.0 3.32 150 4.98 
4.2 2.77 160 4.43 
4.4 2.29 180 4.12 
4.6 1.98 200 3.96 
EE Kee EE 4.8 2.02 210 4.24 
September. ts 5.0 4.65 160 7.44 
October EE 5.2 7.44 140 10.42 
ELE EE 5.4 8.53 112 9.55 
December Een 5.6 8.79 111 9.76 
1945: 
January.. 5.8 8.41 109 9.17 
February. 6.0 7.44 107 7.96 
6.2 6.26 105 6.57 
6.4 .5.31 103 5.47 
6.6 4.36 101 1.40 
6.8 3.54 99 3.50 
7.0 3.01 90 2.71 
August. i 7.2 3.02 85 2.57 
September............ dcs 7.4 6.88 75 5.16 
Octobereut t Sun os 7.6 10.87 65 7.07 
November 7.8 12.32 60 7.39 
December 8.0 12.56 55 6.91 
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TABLE 80.—SEASONAL VARIATION 
(In percentages of the average month) 


January... š | 145 | ME 66 | September..... 93 
February 124 | June. 52 | October...... 143 
March.... .| 101 [July.. Ve? 43 | November.....| 158 
Apii desse ir 83 | August... .. 42 | December..... 157 


When the seasonal variation and trend are combined, a line 
like BB’ in Fig. 135 is produced; the data are shown in column 
(3) of Table 79. To obtain each monthly value for the line 
BB’ cach monthly coordinate of the line AA’, that is, the growth 
element in the time series, has been 
multiplied by the index of seasonal vari- 175 
ation for the corresponding month. This 
has the effect of redistributing the total of 150 
the 12 monthly figures of the growth line 
in such a manner as to make them prop- 25 
erly reflect the seasonal element. Thus 
the trend figure for January is multiplied 
by 1.45 (or 145 per cent) while the April 
trend figure is multiplied by 0.83 (or 83 
per cent). 

A third element of variation in time 
series is cyclical fluctuation, which may 50 
extend over several years. For example, 
Fig. 137 shows the rising and the falling Desen "e 
movement of a cyclical fluctuation by Fre. 186.—Seasonal 
months that occurs over a period of 3 variation in the hypo- 
years; this is shown also in column (4) of Kee pater mince 
Table 79. In column (5) of Table 79 and 
in Fig. 138 are shown the effect of combining also the cyclical 
movement. The figures for the respective months are now 
altered according to whether the eycle is carrying them upward 
or downward, and the percentage figures for the cycle, shown in 
column (4), depict this upward and downward swing of the cycle. 
The cycle is put into the data by multiplying each monthly 
figure in column (3) by the corresponding monthly index of the 
cycle found in the same row of column (4). The results are 
shown in Fig. 138, which is the final hypothetical time series; 
the data for it are in column (5) of Table 79. 
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Two important effects of combining the growth element and 
the seasonal-variation element are noticeable from Fig. 135. 
In the first place, the combination has a tendency to obscure the 
trend. It is still clear in line BB’ that there is a rising tendency, 
but the wide sweeps of the seasonal fluctuations tend to conceal 
the exact nature of the rise; for without the line AA’ in Fig. 135 it 
would be difficult to visualize precisely what the slope of this 
trend actually is. In the second place, the combination definitely 


225 
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100 


UE 0 
1943 1944 1945 1943 1944 1945 


Fra. 137.—The cycle in the hypo- Fra. 138.—All three component ele- 
thetical time series. See Table 79, ments of the hypothetical time series 
column (4). combined. See Table 79, column (5). 


distorts the shape of the seasonal variation, in two ways: (1) It 
causes the valleys and peaks to be thrown out of line arith- 
metically. (2) It minimizes the size ef the seasonal variation 
where trend is low and exaggerates the size of the seasonal 
variation where trend is high. 

From Fig. 138 it is clear that the effect of including the 
cyclical movement is further to obscure the trend or growth 
element and to distort still more the character of the seasonal 
variation. It is in approximately this condition that most time 
series exist in their raw state. Raw data of time series contain 
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in varying degrees elements of all three of these types of fluctua- 
tion. Some have little seasonal variation, some have a great 
deal, and some have none. Many have rising trends following 
population and general growth, while a few have declining 
trends because they represent decaying or disappearing types 
of economic or social activities. In practically all time series, 
cycles of varying length and varying amplitude occur. 

In addition to the three elements illustrated by the hypo- 
thetical case, most time series contain fluctuations due to unusual 
or residual occurrences, such as the effects of floods, storms, or 
strikes. This gives four elements or types of fluctuation and 
these four types of fluctuation serve as a good classification for an 
empirical start in the analysis of time series.' 


GENESIS AND PURPOSES OF THE TIME-SERIES ANALYSIS 


The hypothetical problem just illustrated consisted in a 
synthesis. The study of time series is analysis—a reversal of the 
procedure that has just been demonstrated. This breaking up 
of time series into its constituent elements, and the various com- 
plications involved, constitutes the subject of time-series analysis. 

Why do economists, social scientists, and statisticians analyze 
time series? What started them along this line of procedure, 
and what are its advantages? The answers to these or other 
questions as to the significance of time-series analysis have in 
general a threefold basis: (1) interest in the population problem 
and the discovery of the law of organic growth, (2) concern for 
the general problem of the so-called “business cycle," and (3) 
preoccupation with the variety of problems associated with 
seasonal influences upon business and social life. 

Rational Trends. Historical Background. In 1798, Thomas 
Robert Malthus, a minister of the gospel and a political econo- 
mist, wrote an Essay on the Principle of Population, in which he 
advanced the fundamental principle that the law of growth of 
population is geometrie—population, he said, tends to grow in a 
geometrie progression. 'The curve representing population 

1'This is the conventional classification of types of fluctuation that occur 
in time series; it was presented in detail by W. M. Persons of the Harvard 
Committee of Economie Research and published in the Review of Economic 
Statistics, Preliminary Vol. 1. See also articles by the same author in the 
American. Economic Review, Vol. 6 (1916), pp. 738-769, and Publications 
of the American Statistical Association, Vol. 12 (1917), pp. 602-642, 
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growth would accordingly be a positive exponential curve, 
similar in character to the curve representing the growth of a 
principal sum of money at compound interest. 

While some of the doctrines of Malthus regarding the controls 
to population growth are no longer accepted as tenable, the 
fundamental principle of the tendency of population to grow 
geometrically has not only been accepted with regard to popula- 
tion theory but has been widely applied in other fields. To 
people of the twentieth century this principle seems almost 
axiomatic, for they are familiar with the history of the nineteenth 
century, when the statistics show such a growth of population 
and such a development of many kinds of activities according to 
this prineiple-of geometric progression. 

The principle of growth was not so obvious to those living at 
the time of Malthus, nor to those living in the middle of the 
nineteenth century. Consequently, it was startling and new to 
see the same principle applied to growth in certain economie and 
social phenomena, as was done by William Stanley Jevons, an 
English economist, in his celebrated book on The Coal Question 
(1865). Chapter IX of that book is entitled Of the Natural Law 
of Social Growth. In this he propounds the idea that many of 
the phenomena of economie and social life follow the same law 
of organic growth as population. In some, the progressive rate 
of geometric growth is greater than that of population; in some, 
less; but in all the growth is geometric. In another chapter of 
the same book, Jevons applied this principle and tested it with 
reference to England’s progress in industry. His contribution 
was of the nature of the proposal of a hypothesis that served as a 
challenge to mathematically minded economists like himself 
and others and soon stimulated the development of ideas as to 
how best to write the equation for the curve that would repre- 
sent growth of population. By such an equation, it was thought, 
population could be forecast far into the future as well as for 
intercensus years. 

Population Curves. In 1891, A. S. Pritchett suggested that 
an equation of the form P = a + bt + et? + d? would fit the 
curve of population growth. The subject of equations for the 
population curve became one of wide concern to population 
students, economists, and scientists in general, as well as of prac- 
tical interest in obtaining accurate estimates of population 
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between the dates of taking the census. In 1907, Raymond 
Pearl proposed that the form of this equation should be 


P =a + bt + ct? + d log t.* 


The problem was again approached by G. Udny Yule, an English 
statistician, in 1925; and in later years the discussion was 
continued.” 

Perhaps the most striking contribution on the subject is that 
of Raymond Pearl and Lowell J. Reed, who in 1920 advanced the 
idea that the population curve should not continue to rise 
indefinitely but should level off after some period of time and 
that thus the population curve showing the law of growth would 
not follow the compound-interest curve indefinitely. Rather, it 
would resemble the curve shown in Fig. 145 in Chap. XXI. 

The mathematical characteristics of this curve and its equa- 
tions are presented by these joint authors in the J ournal of the 
Royal Statistical Society for 1927.7 As will be clear from a glance 
at Fig. 145, the shape of this curve indicates about the growth of 
population or the law of organic growth that the first period 
of relatively slow arithmetieal growth is followed by a period 
of very rapid arithmetieal growth but that finally a period of 
slowing down of this rapid arithmetical growth occurs so that the 
curve at the top assumes an asymptotic character. 

Early Population Theories. Quételet remarked that Malthus’s 
doctrine resolved itself essentially to the proposition that, under 
the most favorable industrial circumstances, population could 
grow no more rapidly than in an arithmetical progression, 
although, of course, he stated the geometric law of growth as a 


* Knees, Gronaz H., “The Laws of Growth of Population," Journal of 
the American Statistical Association, Vol. 21 (1926), p. 381. 

1 Journal of the Royal Statistical Society, Vol. 88 (1925), pp. 1-62, which 
contains an excellent historical summary of the problem of curve fitting to 
population growth. 

2 Reen, L. J., and Raymond Peart, “On the Summation of the Logistic 
Curve,” Journal of the Royal Statistical Society, Vol. 90 (1927), pp. 729-746. 
The mathematies of the curve was discovered, say the authors, by Verhulst, 
according to Quételet writing in 1838, and was again applied to population 
by Pearl and Reed in 1920. Cf. Peart, RAYMOND, Studies in Human 
Biology (1924), Chap. XXIV, The Curve of Population Growth. 

1 Op. cit. 
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tendency. It could grow only in arithmetical progression 
because it would be kept down to that rate by the fact that sub- 
sistence grows only in arithmetical progression. He also pointed 
out that the theory of: population growth up to the time he was 
writing (1836) had not been developed to the point where it 
could be considered “dans le domaine des sciences mathé- 
matiques, auquel elle semble spécialement devoir appartenir."'! 
Even so, Quételet himself never went to the point of developing a 
mathematical equation expressing the law of population growth, 
although in other ways his contributions as a population theorist 
are outstanding. However, he did reach the point of suggesting 
that the law of population growth is like that of a body traveling 
through a resisting medium that tends to attain a limiting 
velocity. 

Yule suggested that this analogy probably inspired Verhulst, 
professor of mathematies at the École Militaire, to a controversy 
with Quételet on the subject. The problem of devising a 
mathematical law of population growth was actively studied by 
Verhulst for a number of years. He fitted logistic curves to the 
population histories of several countries for as many years as 
data were available, but the limited amount of data did not 
inspire confidence in the results 3 This work of Verhulst seems 
to have been forgotten until the time of the Pearl-Reed studies 
of 1920. Pearl and Reed’s discovery of the law of population 
growth in the mathematical form developed by them was 
independent. As Yule says, they seemed to have been unaware 
of the formulation by Verhulst. 

Basis for Rationalizing Trends. The attempt is made by 
students of the law of population growth tò rationalize the 
fitting of such a logistic curve to experienced growth of popula- 
tion in many parts of the world, and at different times, by basing 
their reasoning upon the following points: 


1 Sur l'homme (Bruxelles, Louis Hauman et Comp. 1836), pp. 283, 287. 

2 Notice sur la loi que la population suit dans son accroissement (correspond- 
ance mathématique et physique publiée par A. Quételet, 1838), tome 10 
(also numbered tome 2 of the third series), pp. 113-121; and by the same 
author, “Recherches mathématiques sur la loi d’accroissement de la popula- 
tion," Nouveaux mémoires de } Académie Royale des Sciences et Belles-Lettres 
de Bruzelles, tome 17 (1845), pp. 1-38; “Deuxième mémoire sur la loi 
d'accroissement de la population,” ibid., tome 20 (1847), pp. 1-32. Citations 
from G. Udny Yule, op. cit., p. 57. 
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1, The construction of such a curve through the plotted points 
showing actual population growth in a large number of places 
produces a good fit. 

2. Biological experiment under controlled conditions, with 
other species than man, produces increases in numbers in a 
manner following such a curve. Thus Pearl made such an 
experiment with fruit flies under controlled conditions.! 

3. Studies of trends in birth rates and death rates, in their 
relation to population growth, appear to fit into the theory that 
the law of population growth follows this curve. 

4. Studies of death rates by age distribution of the population 
and the relationship between age composition and total death 
rate and birth rate of a population appear to fit into the law of 
population thus formulated.” 

5. While it is true that the parabolas of earlier writers fit 
empirically the population growth wherever tried, such a curve 
fit cannot be rationalized, because the extension of the parabola 
goes on to infinity. On the other hand, the logistic curves of the 
Verhulst, Pearl-Reed, or Gompertz variety approach a limit in an 
asymptotic manner, which seems to be a more rational manner in 
which to view the law of population growth. 

6. The asymptotic limit that it is assumed population is 
approaching can be closely approximated by study of the circum- 
stances surrounding the determination of the factors influencing 
population growth. 

Thus, it is recognized in this theory of the law of population 
growth that should technological changes comparable with the 
industrial revolution occur, the asymptotic limit might have to 

1 Cf. PEARL, R. The Biology of Death, pp. 253-254. Cited in Yure 
op. cit., p. 22. 

2 These ideas have reached the general publie as well as the scientific 
group, through such articles as Robert A. Kuczynski, “The World's Future 
Population,” The New Republic, May 7, 1930; Aaron Hardy Ulm, “Our 
Falling Birth Rate Is Studied by Experts,” The New York Times, Mar. 2, 
1930; Louis I, Dublin (Statistician of the Metropolitan Life Insurance Com- 
pany), “America Approaching Stabilized Population,” The New York 
Times, Mar. 4, 1930; and by the same author, “Our Aging Population: Its 
Vital Effects," The New York Times, Jan. 4, 1931. Cf. also Dusun, 
Lovis, I., and ArrnEp J. Lorka, “On the True Rate of Natural Increase,” 
Journal of the American Statistical Association, Vol. 20 (1925), pp. 305-339; 
and Dusu, Lovis I., “The Statistician and the Population Problem," 
ibid., Vol. 20 (1925), pp. 1-12. 2 
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be raised and that the law of population growth over a period of 
centuries may be conceivably a series of ogive-like cycles. 

Criticism of Rationalized Trends. However, this rationalistic 
view of curve fitting to population and the attainment in this 
manner of a mathematical law of population growth have not 
gone unchallenged. Prof. A. L. Bowley, an outstanding English 
statistician, says, “I regret that so much prominence has been 
given to the logistic equation. It certainly has the merit, and 
the danger, of mathematical neatness, and it expresses what may 
be regarded as a fundamental law of population—that is, that 
population cannot increase indefinitely in constant geometric 
progression. There is, however, no reason a priori to suppose 
that the damping down of the increase is of so regular or uniform 
a nature that a mathematical function of the same form repre- 
sents it in all times and in all places, and none a priori to justify 
the use of a linear term (out of all possible functions) for this 
purpose. We should rather anticipate that the form of the 
funetion would be neither general nor linear. The justification 
for the logistic form is purely empirical, and, in fact, we are asked 
to accept it because it does give results which agree with the 
records of certain populations. Any other curve which gives as 
good an agreement has similar claims for representing the series 
of records. The closeness of the agreement is, I think, unduly 
accented by the very small vertical scales used by Dr. Pearl and 
Mire yale: ies at 

T. H. C. Stevenson, another English statistician, rather 
prosaically declares that he finds sufficient explanation, without 
resort to logistic curves, for the rapid decline in birth rates since 
the end of the nineteenth century, in the dissemination of knowl- 
edge of eontraception.? 

More recently, the whole question of the rationality of curve 
fitting was taken up in an admirably thorough manner by 
George H. Knibbs, whose findings are apparently that the 
mechanical process of the curve fitting is empirical and must be 
accepted as empirical but that the law of population growth 
may be conveniently expressed by such equations when it is 

1 From remarks on Yule’s paper, op. cil., p. 76. 


*“The Laws Governing Population," Journal of the Royal Statistical 
Society, Vol. 88 (1925), pp. 63-76. 
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thoroughly understood how those equations apply, and also 
their limitations.! 

Tt is natural to scientists to be skeptical, particularly of other 
scientists’ startling discoveries, and the student of social science 
must get used to such controversies and pick and choose for 
himself what he believes to represent progressive development 
of human knowledge and what merely overzealous creative 
imagination. It is in these attempts to explain phenomena that 
the progressive development of human knowledge occurs. 

Application of Rational Trends to Social Philosophy. It was 
pointed out above that Jevons had advanced the hypothesis 
that the law of organic growth applies also to social and economic 
phenomena. Following the example of the population curve- 
fitting group, scientific curiosity turned to the discovery of a 
rational conception of curve fitting to social, biological, and 
economic phenomena in order to replace purely empirical 
methods. As Wesley C. Mitchell has pointed out,? “A step 
toward such a conception is represented by the frequent inter- 
pretation of certain trend lines as showing the ‘growth factor.’ 
Statisticians dwell with satisfaction upon their demonstrations 
that certain industries have expanded decade after decade at a 
substantially uniform rate, or at a rate which has changed in 
some uniform way. They take almost as much pleasure in con- 
templating the somewhat similar rates at which different indus- 
iries have grown in given periods and countries. Nor are they 
at a loss for explanation of these uniformities. In view of the 
inerease in population charaeteristie of the great commercial 
nations and of the advance in industrial technique, it seems 
seareely fanciful to think of modern society as ‘tending’ to 
produce an ever larger supply of goods for the satisfaction of its 
wants. On this basis, cyclical fluctuations appear as alternating 
accelerations and retardafions in the pace of a more fundamental 

1Laws of Growth of Population," Journal of the American Statistical 
Association, Vol. 21 (1926), pp. 381-398; and Vol. 22 (1927), pp. 49-59. 

? Business Cycles—T he Problem Stated and Its Setting, (1928), pp. 221—224; cf. 
Prescott, Raymonp B., “Law of Growth in Forecasting Demand,” (1928), 
Journal of the American Statistical Association, Vol. 17 (1922), pp. 471-479. 
Later, Leroy E. Peabody fitted such a curve to railway traffic in the United 


States, “Growth Curves and Railway Traffic,” Journal of the American 
Statistical Association, Vol. 19 (1924), pp. 476-483. 
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process. Secular trends, in short, are taken to measure economic 
progress generation by generation. 

“A bold speculation of this sort has been ventured by Raymond 
B. Prescott. He suggests that perhaps ‘all industries, whose 
growth depends directly or indirectly upon the ability of the 
people to consume their products,’ pass through similar phases 
in the course of their development. Four stages seem to be 
common. 

1. Period of experimentation. 

2. Period of growth into the social fabric. 

3. Through the point where the growth increases, but at a 
diminishing rate. 

4. Period of stability. 

“On this basis, Prescott suggests that the secular trends of all 
such industries may be represented by a single type of curve— 
that yielded by the Gompertz equation. Every country may 
have a different rate of growth, and so may every industry, 
because no two industries have the same combination of in- 
fluences. They will trace the same type of curve, however, 
even though the rate of growth is different." 

More recently, an ambitious and carefully studied attempt to 
rationalize the whole subject of trends in economie phenomena 
was made by Simon S. Kuznets, of the National Bureau of 
3iconomie Research.! Kuznets analyzes the various factors 
making for growth, and also making for slowing up of growth, 
under the following items: 

1. On the side of growth: 

Population increase. 
Changes in demand. 
"Technological changes. 

2. On the slowing up of growth: 

Slackening of technological progress. 
Retarding influence of other slower industries, 


Funds available for expansion decrease in relative size as industry 
grows. 


Competition of later developing industries in other countries. 


1 Secular Movements in Production and Prices (1930); in 1943, Kuznets's 
work is still the best statistical study of this type. For more recent trend 
studies of a different type, see Edwin Frickey, Economic Fluctuations in the 
United States, 1866-1914 (1942), and Norman J. Silberling, The Dynamics 
of Business (1942). These studies use subjective methods for analyzing 
trends and cycles. 
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Kuznets fits logistic curves to a large number of production 
series and also fits appropriate curves to the corresponding price 
series. It should be noted that this type of rationalization does 
not apply to price series, and as a rule the curves that Kuznets 
fits to his price series were merely parabolas and represented 
empirical trends. One of the most interesting results of his 
work is his discovery and analysis of “secondary trends.” 
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Fra. 139.—Produetion of Portland cement in the United States with logistic 
trend line, 1880-1924. 


Thus, from a large variety of data, he took out the long-time 
growth, upon the assumption of the existence of a logistic growth 
element, and he found, not only cycles, but also longer wavelike 
movements of 11 to 20 years. This is illustrated by Figs. 139 to 
141, reproduced from his book and showing the type of analysis 
as applied to cement production and prices, 1880-1924.1 As 
seen in Fig. 139, the heavy line represents the logistic curve, and 
there are long sweeps of the actual data in waves above and below 


1 Op. cit., pp. 100-101, reproduced by permission of the author. 
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this growth curve, as well as cyclical movements of shorte: 
duration. Figure 140 shows a parabola fitted to the course of 
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and primary trend, 1880-1924. 


prices of cement during the same period. Figure 141 shows the 
long wavelike movements in production and in prices, with the 
relative fluetuations of the actual data above and below these 
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Fro, 141.—Production of and prices of Portland cement in the United States, 
1880-1924. Secondary trends and minor cycles. 


secondary trends. Kuznets calls the logistie growth curve the 
"primary trend line," and the heavy, black, wavelike line in 
Fig. 141 represents the secondary trends of the production of 
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cement. The actual data fluctuate above and below these 
secondary trends in major and minor cycles. 

Before the publication of Kuznets’s work these longer move- 
ments had been studied by C. A. R. Wardell. Wardell called 
the movements “major cycles” rather than secondary trends, 
and his method of analysis was quite different from that followed 
by Kuznets. He also attempted to give an explanation of the 
major business cycle that Kuznets rejects.? In 1927, also, there 
appeared in Russian a discussion of the whole problem of major 
cycles, which contains a report by Kondratieff and a counter- 
reply by D. T. Oparin. To explain these major cycles Kon- 
dratieff developed the theory that they are essentially cycles 
of expansion and contraction in the growth of the basic capital 
equipment of a country. 

Thus, starting with the desire to define the law of population 
growth more precisely and to bring the population problem into 
the realm of mathematical treatment, scholars have carried on 
by analogy into other fields; so far as economies is concerned, 
the principal result so far is the discovery of these long wavelike 
movements. Not only do the theoretical economists need to 
explain the old-fashioned business cycle (which was always a 
rather vague concept), but they now are challenged to explain 
(1) secondary secular moyements or major business cycles, (2) 
ordinary business cycles, and (3) minor business cycles. The 
analysis of time series, then, must include some additional types 
of fluctuations from those described in a preliminary manner at 
the beginning of the chapter. 

The following classification of movements is now suggested.* 

1. Trend, or long-time growth, which appears to be logistic in 
character and for which a mathematical formula may be rational. 

2. Cyclical movements of three types, for which a rational 
mathematical formula is not appropriate. 

a. Secondary secular movements or major cycles. 
b. Cycles (the old theoretical business cycle). 
c. Minor cycles. 

1 An Investigation of Economic Data for Major Cycles, Thesis (University 
of Pennsylvania, 1927). 

2 Op. ctt., pp. 265-266, 

3 Cf. classification suggested by Prof. Willford I. King, which is similar, 
in “Principles Underlying the Isolation of Cycles and Trends,” Journal of 
the American Statistical Association, Vol. 19 (1924), p. 468. 
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3. Seasonal variations, for which a mathematical formula is not 
rational. 

4. Irregular fluctuations, such as those due to wars, epidemics, 
floods, or strikes. These are called “residual fluctuations" and 
may follow the normal curve.: 

Empirical Trends. Trend analysis, that is to say, the applica- 
tion of mathematieal processes in order to obtain equations 
describing direction of movement of a time series, may be 
applied, not only for rational ends indicated in the discussion of 
the law of organie growth, but also empirically where no a priori 
knowledge about the charaeter of growth or law of movement or 
trend exists. Indeed, the search for such a law may have no 
bearing on the analysis; the trend may be sought for the purpose 
of isolating and studying cyclical movements. When trends are 
found without seeking to verify some hypothesis concerning 1 
law of growth but merely with respect to given data, they are 
empirical trends. 

Application of Empirical Trends to Cycle Analysis. The 
third factor mentioned at the beginning of this chapter as a 
force stimulating statistical analysis of time series has been the 
abstract study of the business cycle. Such abstract analysis 
has challenged the mathematical economist and the statistician 
to discover and to apply methods of statistical analysis that 
would measure the cycle. 

Economie history of the modern era has been one of alter- 
nating periods of relative prosperity and relative depression. 
and has also been characterized by periods of more or less violent 
speculative activity. The Mississippi Scheme and the South 
Sea Bubble burst in France and England in 1720, and there 
occurred commercial crises of major importance in 1763, 1772, 
1783, and 1793. During the eighteenth century these recurring 
periods of erisis excited much discussion, but eighteenth-century 
writings dealt mainly with the dramatic surface events and did 
not develop a theory explaining them. And indeed the funda- 
mental principles of economies were not formulated until the 
latter half of the eighteenth century. The publication of Adam 
Smith’s Wealth of Nations in 1776 is usually taken as the debut 
of economics as a science. 


1See pp. 285-297, and 570, 648. 
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While a group of economists following Adam Smith developed 
a theoretical explanation of the operation of economic forces 
under normal conditions, or in the long run, another group that 
assumed the role of critics of the "economists" developed 
theories of the business cycle. These were such men as Sismondi, 
Rodbertus, and Karl Marx. J. C. L. Simonde de Sismondi, an 
Ttalian Swiss, had originally been a thorough convert of Adam 
Smith and laissez faire and had become the Continental expositor 
of his theories; but as he said, writing in 1818 and referring to 
the depression of 1815-1817, he was deeply affected by the com- 
mercial crisis that Europe had experienced and by the cruel 
sufferings of the industrial workers that he had witnessed in 
Italy, Switzerland, and France and that all reports showed to 
have been at least as severe in England, in Germany, and in 
Belgium.! He set about developing a theory to explain the 
recurrence of such periods, and in his work are found many of 
the ideas current even today concerning the origin and explana- 
tion of the business cycle. He suggested that the business cycle 
is due to the faulty organization of the capitalist system and that 
the system is planless and therefore needs planning. He also 
suggested the explanation that what is needed is a better dis- 
tribution of income. He suggested the oversaving hypothesis. 
His principal explanation is the inequality in the distribution of 
incomes resulting in glutting of the markets and the production 
of crises and depressions. 

The idea that commercial crises are cyclical in character 
evolved early in the nineteenth century; some even went so far 
as to advance the theory that they occur every 7 or every 11 years. 
In 1875, this led the economist and statistician, W. 8. Jevons, to 
propound a theory that the business cycle is due to cycles that 
occur in sun spots, which it had been discovered have a rhythm 
of about 11 years. During the latter half the nineteenth cen- 
tury a number of statistical attempts to discover the business 
cycle were made. The attempts used the idea of smoothing 
out the irregular fluctuations in the curves of raw data and 


1 MITCHELL, op. cit., pp. 4-5. The historical material here given on the 
business cycle is taken principally from this source. 

? For a more complete discussion of the history of business-cycle theory 
than it is possible to give here, see ibid. and also Ernst Wagemann, Economic 
Rhythm, (1930), either of which contains further bibliographical references. 


560 STUDY OF DYNAMIC VARIABILITY 


thereby clarifying the cyclical movements. ‘The earliest 
examples of such statistical work appear to be in 1884.1 

Both Jevons and the later experimenters of the nineteenth 
century were content with attempts to discover cyclical move- 
ments in separate individual series. In 1909, Beveridge in 
England; in 1911, Julin in France; and in 1913, Mortara in 
Italy conceived the idea of combining a number of series into a 
composite statistical measure of the business cycle. The work 
of carrying out this task was then largely taken over by the 
American statisticians, in the construction of the so-called 
“barometers” of business conditions that have been deseribed 
in Chap. XIX Index Numbers. The period up to about 1914 
may be characterized as one during which interest in the subject 
of the business cycle was intense. Economists were in sharp 
controversy with the business-cycle theorists—denying emphat- 
ically the implications that they drew from their analysis of the 
statistics available and from their theoretical explanations of the 
business cycle. At the same time, the disturbing theories of 
the business-cycle students had greater claim to general inte 
because they touched upon a more vital and present thing than 
was customarily dealt with by the conventional economist. 
The conventional economist was explaining how things tend to 
happen under normal conditions, and the business-cycle theorists 
loudly proclaimed that we never live under ‘normal conditions” 
and that the theories of the economist were therefore useless. 
At the same time, the interest of the practical businessman was 
aroused by the desire for knowledge of the relationship between 
his own particular business and the general business cycle. 

Development of Technique for Time-series Analysis. The 
pressure to develop a statistieal technique to analyze the prob- 
lem was thus very great, and the accumulation of available 
statistical material to analyze had been rapid for a number of 
years. The technique that developed assumed two general 
characteristies, one of which has since been extensively used, 
the other less frequently. 

The first method of technique that developed was the discovery 
statistically of the cycle in time series by the removal of the 


d 1 PovsrtNa, J. IL, and R, H. Hookzn, “A Comparison of the Fluctua- 
tions in the Price of Wheat and in the Cotton and Silk Imports into Great 
Britain,” Journal of the Royal Statistical Society, Vol. 47 (1884), pp. 34-64. 
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trend from a series of annual data, Trends were fitted empir- 
ically to the data by the method of least squares or some other 
method—most commonly by the method of least squares—using 
relatively short periods of time, say 9 to 19 years. The cyclical 
movements then were the measures of the movements of the 
data above and below the empirical trend. Prof. Willford 1. 
King said, “Any particular type of fluctuation in which we 
happen to be interested can be successfully studied only when 
most of the other kinds of fluctuations have been eliminated." 

This is, of course, the raison d'étre for the empirical trend 
analysis, which is primarily for the purpose of isolating the ordi- 
nary and the minor cycles. The major cycles or secondary 
secular movements are best studied by the Kuznets methods that 
have been deseribed and illustrated. The methods of analysis 
used are essentially similar to those employed in empirical trend 
analysis, but the Kuznets logistic trend lines may be rationalized 
in terms of a law of organie growth. 

The second method of technique that developed was the 
attempt to apply harmonie analysis or the periodogram to series 
of economie data, a different application of the method of least 
squares. This was the work of Henry L. Moore of Columbia 
University in his application of Fourier’s theorem, the mathe- 
matics of which Fourier had developed a century ago in his 
Théorie des mouvements de la chaleur dans les corps solides and 
for which he was feted by the Académie des Sciences in 1812. 

Prof. Moore applied the mathematics of the periodogram to the 
records of rainfall in the corn belt of the United States, working 
out the periodogram equations for the cycles of rainfall; he 
discovered similar cycles in crops and introduced the harmonic 
analysis into modern statistical method. He says:* “The prin- 
cipal contribution of this essay is the discovery of the law and 
cause of economie eycles. The rhythm in the activity of eco- 
nomie life, the alternation of buoyant, purposeful expansion 
with aimless depression, is caused by the rhythm in the yield 
per aere of the crops; while the rhythm in the production of the 


1 Journal of the American Statistical Aasociation, Vol. 19 (1924), p. 468. 

2 Economic Cycles: Their Law and Cause (1914), Cf. Crum, W. L, 
“Periodogram Analysis,” Chap. XI in H. L. Reitz, Handbook of Mathe- 
matical Statistics (1924). Also Brun, Davin, The Combination of Observa- 
tions (1931), Chaps, XI and XII. 
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crops is, in turn, caused by the rhythm of changing weather, 
which is represented by the cyclical changes in the amount of 
rainfall. The law of the cycles of rainfall is the law of the cycles 
of the crops and the law of economic cycles.” 

The mathematics of the harmonic analysis are somewhat com- 
plex, and this method has not attained the popularity that 
has been attached to the removal of empirical trend by using 
straight lines or second- or third-degree polynomials, where the 
mathematical analysis involved is quite simple. 

Use of Functions of Arc Tangent and Orthogonal Polynomials 
in Trend Analysis. In recent years two modified forms of the 
conventional trend analysis by the method of least squares have 
been developed. In 1928, it was suggested that the inverse 
trigonometric function, or are tangent, could be adapted to 
measuring trends in series behaving in the following manner:! 

1. A downward tendency approximating a straight line but 
of such nature that projection of a straight line into the future 
would lead to absurd results, that is, negative or ridiculously 
small positive values when comparatively large positive values 
only are possible. 

2. Approximately a linear growth or decline, followe by an 
abrupt change in level (rise or drop) and subsequent resumption 
of the early tendency. 

3. Approximately a straight-line trend interrupted by a sharp 
rise or drop, followed by another abrupt change in level and 
subsequent resumption of the early movement at the same or a 
different level. 

The method was used successfully in fitting a trend to the 
annual prices of International Paper common stock for the period 
1900-1926 and to the annual index of wholesale prices in the 
United States, 1900-1928. 

The orthogonal analysis is a method invented for reducing the 
amount of arithmetical calculation involved in fitting poly- 
nomials to time series by the method of least squares, especially 
second- and third-degree polynomials or polynomials of higher 
degree. The fitting of a polynomial of higher than second degree 
to a time series involves laborious calculations, particularly if a 
considerable period of time is covered. This laborsaving method 


1 CARMICHAEL, F. L. “The Are Tangent in Trend Determination," 
Journal of the American Statistical Association, Vol. 23 (1928), pp. 253-262. 
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is described in detail, together with tables of values to facilitate 
its use, in Chap. XXII.! 


1See pp. 600-615. Also cf. JogpaN, CHanLEs, "Approximation and 
Graduation According to the Principle of Least Squares by Orthogonal 
Polynomials,” The Annals of Mathematical Statistics, Vol. 3 (1932), pp. 257- 
357. Cf. Romanovsxy, V., “Note on Orthogonalizing Series of Functions 
and Interpolation,” Biometrika, Vol. 19 (1927), pp. 93-99; JORDAN, CHARLES, 
“Sur une série de polynomes dont chaque somme partielle représente la 
meilleure approximation d’un degré donné suivant la méthode les moindres 
carrés," Proceedings of the London Mathematical Society, Vol. 20 (1921), pp. 
297-325; and DrEULEFArT, Cartos E., “La determinación de la tendencia 
secular en las series económicas,” Gabinete de Estadística, Rosario, Argentine 
Republic (Santa Fe), Universidad Nacional del Litoral (1932), pp. 1-52. Cf. 
Frscugm, R. A., Statistical Methods for Research Workers (4th ed., 1932), 
pp. 133-142, 


CHAPTER XXI 
TREND ANALYSIS 


Empirical Trend vs. Rational Trend. Both empirical and 
rational trends are obtained by analysis from raw data; the 
difference between the two is that a rational trend can be 
explained in terms of long-time growth or decline, whereas an 
empirical trend has no meaning per se. The empirical trend is 
a useful tool of analysis, as will be seen in the ensuing discussion. 
In the preceding chapter the attempt was made to convey the 
idea that a rational trend is one that is found for its own sa! 
it has a rational explanation and is useful as a method of inter- 
pretation in itself, While it may be true that the rationaliza- 
tion that is made with respect to such trends is preliminary or 
even tentative, nevertheless the original intent is to make a 
rational use of them. Empirical trends are those for which there 
is admittedly no rational basis at the start, being used merely as 
a convenient method of removing from the data longer time 
movements that obscure the shorter time cyclical fluctuations. 
Empirical trends in themselves may have no rational sig- 
nificance as a description of any type of long-time growth, or 
movement. An empirical trend calculated for a period of 9 years 
at a point in time coincident with the peak of a secondary secular 
movement would presumably be in the form of a parabola. At 
another point in time, a 9-year trend analysis may give a straight 
line, or a logarithmic line. If a trend line happened to be cal- 
culated for a period of time from the low point of a secondary 
secular movement to the high point of another, the empirical 
trend might assume the form of a Verhulst growth curve; but 
it may have no such significance as a law of growth in that case, 
being simply the result of happening to take an empirical trend 
for that period of time. An examination of the heavy black 
curve representing the secondary trends in cement production 
(Fig. 141, page 556) will help to make clear what is meant by these 
statements. 
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Detecting Cycle by Removing Empirical Trend. While empirical 
trends may have no rational significance per se, the fitting of 
an empirical trend to the annual data of a time series will make 
it possible to isolate the residuals from the trend. These 
residuals constitute the cycles and minor cycles of the period 
analyzed. The first clear statement of the analysis of time 
series by this method was made by W. M. Persons in 1915." 
The method is illustrated by examples at the end of this chapter. 

Thus the function of empirical trend analysis is to obtain an 
approximation to some longer term movement for the purpose of 
climinating this in order to study shorter term movements of a 
cyclical or accidental character. The empirical trend may 
approximate a segment of a long-term cyclical movement, or it, 
may approximate a portion of long-term growth in the series 
that might itself have rational explanation. What the empirical 
trend measures depends upon the circumstances in each problem, 
and the discovery of the rational nature of an empirical trend 
depends upon a priori knowledge. 

Methods of Fitting Trend. Three methods of fitting trend to 
time series can be distinguished: (1) the method of least squares, 
(2) the method of selected points, and (3) the method of 
averages. 

Fitting a Trend Line by the Method of Least Squares. Figure 142 
represents a plane in which there are seven E EE 
To simplify the arithmetic an uneven number of points is taken, 
and the middle point is selected for the location of the y-axis.” 
Accordingly, £ varies from —3 to +3. The coordinates of the 
points, as may be observed from the figure, are as follows: 


Pit = —3 41) Pelz —2, ys) P(t = —1, ys) 
Pit =0, ys) Ps =1, ys) Pot = 2,40) Prt = 3, y) 


1 American Economic Review, December, 1916, pp. 739-769; Publications 
of the American Statistical Association, June, 1917, pp. 602-642; Harvard 
Review of Economic Statistics, Preliminary Vol. 1 (1919). Cf. MITCHELL, 
W. C., Business Cycles—The Problem Stated and Its Setting, pp. 200, 212-213, 
328-330. 

? For statistical purposes it is more convenient to take a more recent year 
as the time origin than that of the birth of Christ. Thus, if a given set of 
data run from 1927, say, to 1937, it might be convenient to choose 1932 as 
the zero year. If this were done, then 1933 would be £ — 1, 1935 would 
bet = 3,1929 would bet = —3, etc. 


566 STUDY OF DYNAMIC VARIABILITY 


The corresponding points on the straight line to be found, for 
example, point A in the figure, may be represented by the 
following coordinates: 


Q--3y) (-—2) (= —1, ya) 
(—0y) (@=1, y) (t-2y) (= 3,41) 
The general form of the equation for a straight line in a field of 
coordinates y and ( is y = a + bt, and for this line the equation 


is as follows: 
y =a+t+bt (1) 


The line is determined for the particular case by finding values 
of a and b. 


The line that is sought is the one from which the sum of the 
squared deviations of the points from the line is less than such a 
sum with respect to any other line. This is the least-squares 
criterion. 

The vertical residuals of particular points from the line are 
as follows, as illustrated in Fig. 142 for P,: 


m= yi y 
T2 = ys A 
Te = ys — ys (illustrated by Psin Fig. 142) 
rr = Yr — Vi 


Some of these variations (designated as r) are negative, for 
example, at P;, while others are positive, as at P, Wher 
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squared, however, they are all positive and the conditions that 
must be satisfied according to the least-squares criterion for a 
line that will best fit these points is that =r? = minimum, in 
other words, that A 
X(y — y? = minimum (2) 
The value of y’, from Eq. (1), may be substituted; Eq. (2) then 
becomes 
X(y — a — bt)? = minimum (3) 
The condition under which Eq. (3) is true is that the total 
differential is equal to zero, in other words, that 


5, Pp? 
d(zr?) = da + ar ) db =0 


Inasmuch as da and db cannot be equal to zero, this gives the two 
conditions that! 
Ceu) e esc c uU TUER 
rper E(y— a — bt? = 32(y— a — bi) =0 


(4) 
e(zr* d ay 
m) ue ue 22t(y — a — bt) = 0 


and hence the following two equations, by canceling out the Ze 
and carrying out the summations: 

Dy = Na + b2t (i) 

Ziy = adt + bzt? (ii) 

In these two equations, all the terms are known, except a and 


b; because St = 0 and Zy is the sum of the known y’s of the 
seven points P3, . . Pa, The Zt? is 


Ee 


Because Zt = 0, values for a and b can be found as follows: 


E 
ue E Go. (i 
a from Eq. (i) 
> u 

"ES E from Eq. (ii) 


1 In the ease under consideration, it is not necessary to be concerned with 
the possibility that these same conditions might also hold true for a maxi- 
mum or a minimum, since the conditions of the problem indicate that it 
is a minimum. 
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Accordingly, the equation for the line of best fit, by the 
criterion of least squares, is as follows: 


y= a tse! (5) 


Numerical Illustration.- As a more concrete illustration, 
values will be assigned to the y’s of the seven points, as follows 
(t coordinates remaining as before): 


Pi(y-—5) Pxy=2) Psy=7)  Piy-4) 
Psy =6) Poly=10) Py = 8) 


An orderly work sheet will be set up in order to find Sy, Zi. 
and >t, N, of course, is equal to 7. 


Work Suerr ror FINDING BEST-FITTING STRAIGHT LINE ror SEVEN 
Given Pornts 


t D ty e 
=g 5 —15 9 
-2 2 —4 4 
E 7 =T 1 
0 4 0 0 
1 6 6 1 
2 10 20 4 
3 8 24 Bo- 
50 
zt = 0 Zu = 42 —26 zt = 28 
Zty = 24 


The equation for the best-fitting line according to the least 
squares criterion is therefore as follows [see Eq. (5)]: 
y = 82 + Sit 
or 
y = 6 + 0.86t 


It will be well to note what the equation says. First, with 
each unit increase of ż, the line (that is, the value of y') rises by 
0.86. This value, 0.86, is called the “slope” of the line; and it 
is the tangent of the angle that the line makes with the taxis 
or with any line parallel to the taxis. Second it says that, 
when ¢ = 0, y' = 6. This means that the line passes through 
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the y-axis at a point +6 from the Lade (when the y-axis is 
located at the middle point in time). 

If the y-axis were shifted from its present location to the 
position ¢ = —3, everything else remaining in its original 
position, the value of the ¢ coordinates of all the points P will 
change to accord with the new location of the y-axis. Also, it is 
to be noted that the above equation would then become 


y’ = [6 — 3(0.86)] + 0.86 
or 
y! = 3.42 + 0.86t 


since 3.42 will be the intercept on the new y-axis. 


Fra. 143. 

Fitting Second- or Third-degree Curves. Second-, third-, or 
even high-degree curves may similarly be fitted by the method 
of least squares. It may happen that the points are distributed 
in such a manner that a straight line does not fit. For example, 
Fig. 143 shows seven points that would be better fitted by a 
parabola. "Phe general form of the equation for such a curve is 

y = a + bt + ct? 

The equations for finding values of a, b, and c, for such a best- 
fitting parabola, are worked out on precisely the same principles 
as those for finding a and b for the best-fitting straight line. 
That is to say, the equation y^ = a + bt + ct? is fitted to the 
points so that 

Z(y — y)? = minimum (6) 
and when the value of y’ is substituted in this equation, it 
becomes 


1For a better method of fitting polynomials by the method of least 
squares, see Chap. XXII, Orthogonal Polynomial Trends. 
* 
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Z(y — a — bt — ct?)? = minimum (7) 


When this expression is differentiated with respect to a, b, 
and c, following the same method as in Eqs. (4), (i), and (ii), 
the equations for finding a, b, and c are obtained, as follows: 


Sy = Na + bZt + cZt? (i) 
Ely = adi + b2 + c2? (ii) 
Zu = adt? + bZt* + crit (iii) 


A work sheet such as the following form (leaving out columns for 
the uneven powers of /; they will presumably all be zero since 
the zero value of ¢ is selected in the middle of an odd number of 
years) is used for finding values of a, b, and c. 


Work Sneer ror Frans BEST-FITTING PARABOLA FOR Seven Givin 
Ports 


Since Zt = 0, when the sums of the columns in the work 
sheet are substituted in Eqs. (i), (ii), and (iii) above, the three 
unknowns a, b, and e may be found by solutions of these. 

Probability Theory Is Not Applied. It must be remembered that 
the application of the least-squares criterion for obtaining the 
line that best fits a time series does not involve the application 
of the theory of least squares in the sense that the trend line 
obtained is a most probable line, expressive of a law of move- 
ment or growth in the probability sense.! As originally applied, 
the theory of least squares had a definite conneetion with the 
theory of probabilities because it was devised as a method of 
obtaining a measure of the most probeble orbit of a comet, ete. 
In the fitting of a trend line to a single time series there is no 
multiplicity of cases fluctuating in a normal distribution about the 


1Cf. Kuzxers, Sion S., Secular Movements in Production and Prices 

(1930), p. 62, who cites W. H. R. Lexis, Zur Theorie der Massenerscheinungen in 

der menschlichen Gesellschaft (Freiburg, i.B., F. Wagner, Ed: 1877), pp. 31-33. 

See also "TTiNTNER, GERHARD, “The Analysis of Economie Time Series," 

Journal of the American Statistical Association, Vol. 35 (1940), pp. 93-100. 
* 
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trend line. "The use of the least-squares criterion in trend fitting 
for time series is merely the application by analogy of a method 
that produces desired results; it gives an objective criterion for 
finding the line of best fit. If the analyst can be satisfied with 
a less objective method, he may use, for example, the method of 
selected points, which will now be described. 

Methods of Selected Points. One of the simplest methods 
of determining the trend of a time series is to make the trend 
“line” pass through certain points selected ns representative of 
normal values, This line! may be drawn in a purely freehand 
fashion, or a mathematical equation may be determined such 
that it is satisfied by the coordinates of the selected points. 

To determine a unique mathematical equation in a given 
case the number of selected points must be taken equal to the 
number of parameters in the equation. Thus, if a straight-line 
trend seems appropriate, two normal years are selected (pref- 
erably near the ends of the series) and the values of a and b in 
the equation y’ = a + bt are so determined that the equation 
is satisfied by the values of ¢ and y for the selected points, If 
a parabolic trend of the type y'= a + bt + ct is deemed 
appropriate, then three normal points must be selected to 
determine the values of a, b, and c, In general, if a polynomial 
of the nth degree is taken to portray the course of the trend, 
viz, y = a + bt + ct Her ° + k^, then there must be n 
selected points. The polynomial is the simplest type of mathe- 
matical equation to employ for this purpose. Other, more 
“rational” types may also be fitted by this method, however, 
and its use in fitting a simple logistic curve is described below. 

The actual process of finding the mathematical equation of 
the chosen type that is satisfied by the selected points consists 
in solving n simultaneous equations, n being the number of 
selected points (or the number of parameters to be determined), 
Thus if (6) and (tays) are the coordinates of the selected 
points, the straight line wi = a + bt passing through these 
points is given by the solution of the following equations for 
a and b: 

yı = a + bl 
y: = a + bts 

1“Line” is here used in the generic sense; it may be either straight or 

curved, d 
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For example, if the time scale is such that t: = 3 and t: = 9 
and if the y values for these years (or months) are! y; = 68 and 
y2 = 110, then a and b are found by solving the equations 

68 = a+ 3b 

110 = a + 9b 
These yield a = 47 and b = 7; hence the equation for the given 
trend is y’ = 47 + 7t. 

Tf the equation to be fitted is a second-degree parabola 
y =a+bt+ct? and if (a), (tye), and (ts,ys) are the 
coordinates of the selected points, then a, b, and c are determined 
by solving the equations 


yi = a + bl + cl 
ys = a+ bt: + cl 
Ys = a + bts + cl 


Three equations are more difficult to solve than two; but if the 
time scale is chosen so that ¢; = 0, then these reduce to 


ya 
Yo = a + bts + cli 
ys = a + bts + ct; 


II 


or 
ys — yi = bls + ct 
Ys — yi = bts + cts 
and two equations are obtained for determining b and c, the 
value of a being yı. For example, if the selected points are 
(ti = 0, yı = 68) (ta = 6, ys = 110) (ts = 12, ys = 200) 
then a = 68 and b and e may be found from the solution of the 
equations 
110 — 68 = 6b+ 36c 
200 — 68 = 12b + 144c 


The results are b = 3 and c = $ = 0.67; hence, the parabola 
which passes through the given points is 
y' = 68 + 3t + 0.678, origin at tı = 0 


When higher degree polynomials are fitted in this way, the 
simultaneous equations may be solved by repeated substitution, 
1 These values may be actual values or values estimated as normal. 
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or special methods making use of finite differences may be 
employed.! 

Method of Averages. Even less refined methods of fitting 
lines to data than those already described could be applied; in 
fact, the analyst could, if he so desired, merely draw the line 
that seems to fit the plotted data. The objection to this method 
is that it is too subjective—no two people would draw the same 
line. A certain degree of objectivity is secured by applying the 
method of selected points, which has already been described, or 
by using a modification of that method, namely, the method of 
averages. The method of averages merely suggests a refine- 
ment in the selection of the points. It can be illustrated by the 
fitting of a straight line, but it could be applied to curves as well. 


Work SHEET rog FITTING A SrRAIGHT-LINE TREND BY THE METHOD OF 
AVERAGES 


ts = 3, 4-5 
For t = 3, y is taken as the average of the first five 
y's; that is, 4^ = 5. 


For ¢ = 8, y is taken as the average of the last five 
y's; that is, 7# = 15. 


D 
5 
2 
rf 
4 
7 
8 

15 

19 

18 = 8, 4, = 15 

15 


= 
@ <° c - Se 


The trend line is the straight line passing through the two 
points t = 3, y = 5 and ( = 8, y’ = 15. Following the same 
procedure as that used in the method of selected points, the 
parameters a and b are found by solving the following two 
equations: 


5 =a+3b 
45 = a + 8b 
from which it is found that b = 2 and a = —1, so that the 
trend line is y’ = —1 + 2t. 


Method of Moving Averages. Ordinarily the method of moving 
averages is used with monthly data, but it could be used with 
annual data if an appropriate number of years over which to 


1 For the latter, the reader is referred to E. T. Whittaker and G. Robinson, 
The Calculus of Observations (1924), Chap. I. 
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average or smooth the data could be determined. The difficulty 
of determining the proper number of years for the averaging 
period is one of the objections to this method; another objection 
is that it does not give an equation of trend. The method of 
moving averages is explained in Chap. XXIII, Seasonal Variation. 

Advantages of the Method of Least Squares. The advantage 
of using the least-squares line is that it gives a line from which the 
residuals add up to zero and when squared are a minimum; this 
supplies an objective criterion to the fit of the line. In addition, 
the least-squares method of trend fitting is a very flexible device 
that can be widely applied and varied according to the type 
of line desired. If a complex trend line is desired, a mathe- 
matical procedure based upon the least-square criterion is 
handily available. The method of orthogonal polynomials 
explained in the next chapter, for example, is an application of 
the method of least squares, 


ILLUSTRATIONS OF RATIONAL TRENDS 


As indicated in the preceding chapter, rational trends are 
likely to be logistic in character. The simplest type of logistic 
curve is of the form y = ab‘, which may readily be reduced to a 
straight line if the equation is expressed in logarithms, as follows: 
log y = log a + t log b. 

Trend of a Dying Institution. If the early development, 
growth, and arrival at maturity of a new economic institution 
follow the. pattern suggested by Raymond B. Prescott, as 
explained in the preceding chapter, presumably the disappear- 
ance of a dying institution would follow a reversal of that pattern. 
Thus, it would die slowly at first, then rapidly, and then slowly 
again until it finally disappeared. If such is the case, the 
appropriate equation to use is one of the Verhulst, Pearl-Reed, 
or Gompertz types of curves. However, an economic institution 
that is disappearing from the economie system might depart in 
another manner; it might be struck a sudden devastating blow 
by a new development that caused it to die or decline according 
to the simple logistic curve y = abt. Such appears to be the 
case with respect to a certain type of commercial bank credit 
known as “open-market commercial paper.” Many author- 
ities on money and credit believe this to be a dying institution 
in this country; and accordingly the downward trend illustrated 
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in Table 81 and Fig. 144 may be considered a rational trend.! 
The data used are annual average monthly volumes of open- 
market commercial paper outstanding; and Table 81 is a work 
sheet for calculating the straight-line logarithmic trend line for 
these data, following the method indicated on pages 566 to 568. 
Here, however, the straight line is fitted to the logarithms of the 
data instead of to the data themselves. 

The equation for this trend line is y^ = abt, so that, by the 
rule of logarithms, 


log y’ = log a + t log b 
The two least-squares equations that would be obtained by the 


method explained above are as follows: 


X log y = N log a + log bXt 
Xt log y = log aEt + log bEt? 


Upon substituting the sums taken from the appropriate columns 
of Table 81, this gives 


36.18035 = 23 log a 
—38.41844 = 1,012 log b 


II 


Írom which 
log a = 1.57306 and log b = —0.037963 


Therefore, the equation of the best-fitting (according to the 
least-squares criterion) logarithmic trend in this case is 


log y’ = 1.57306 — 0.037963 


When a logarithmic straight line is fitted to a time series by 
the method of least squares, it is the sum of the squares of the 
ratio residuals that is matle a minimum—and not the sum of the 
squares of the actual residuals as is the case where an arith- 


1 For explanations of the demise of open-market commercial paper see 
B. H. Beckhart, The New York Money Market, Vol. 3, pp. 242-246; O. A. 
Greef, The Commercial Paper House in the United States (1938), pp. 123-127; 
P. Hunt, Portfolio Policies of Banks in the United States 1920-1929 (1940), 
pp. 11-38. 

2 See pp. 566-567. 
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TABLE 81.—Work SHEET ror CALCULATING ANNUAL INDEX OF NORMAL 
AND TREND 
Straight-line logarithmic trend 
Dara: Open-market commercial paper outstanding. Annual averages of 
monthly data 
(In millions of dollars) 
Equation of trend: log y’ = 1.57306 — 0.037963¢ 


SNR 
log: of computed 
Ses Gene bey t tlógy & | eS |Trend| end 
v logy’ | " A 
y | 
Eus KEE —13 | 
Rr eee =12 | 
1919 11,084 2.03503| —11 |—22.38533| 1211.99065 979 110.7 | 
1920 1,113 2.04650| —10 | —20.46500 100/11.95269, 897 124.1 H 
1921 749 1.87448] —9 |—16.87032 811.91473 822 91.1 | 
1922 768 1.88536) —8 |—15.08288 641.87676 753 102.0 
1923 834 1.92117| —7 |—13.44819 491.83880 690 120.9 
1924 873 1.94101] —6 —11.64606 361.80084 632 138.1 y 
1925 743| 1.87099} —5 | —9.35495 251.76288| 579 128.3 | 
1926 629 1.79865) —4 | —7.19460 161.72491| 531 118.4 
1927 585 1.76716)  —3 | —5.30148) 91.68695 486 120.4 | 
1928 494 1.69373) —2 | —3.38746) 41.64899 446 110.8 | 
1929 322 1.50786) —1 | —1.50786; 11.61102 408 78.9 I 
1930 489 1.68931 0 0 01.57306 374 130.7 I 
1931 264 1.42160 1 1.42160 11.53510 343 77.0 | 
1932 106, 1.02531 2 2.05062 41.49713 314 33.8 1 
1933 95| 0.97772 3 2.93316 9/1.45917| 288 33.0 
1934 156| 1.19312 4 4.77248 16)1.42121) 264 59.1 f 
1935 174| 1.24055 5 6.20275 25/1 .38324| 242 71.9 | 
1936 188) 1.27416 6 7.64496 36/1.34528) 222 84.7 ; 
1937 296| 1.47129 7 10. 29903 49/11.30732| 203 145.8 
1938 239| 1.37840, 8 11.02720 64)1. 26936) 186 128.5 
1939 198 1.29667 9 11. 67003 811.23139 170 116.5 
1940 234) 1.36922 10 13.69220 1001.19343 156 150.0 
1941 317 1.50106 11 16.51166 121/1.15547| 143 221.7 
zs pan wes 12 
eor dl maia 13 | 
: 36.18035 .... |—38.41844 1,012]. ......| ... | 2,496.4 j 
N —23]|..... Zlogy|....| Ztlogy zu Xy = y' 


Source: Compiled from the Annual Report of the Federal Reserve Board, 1929, p. 121; 
1935, p. 174; and from the Survey of Current Business, Annual Supplement, (Vol. 20, 1940), 
p. 47, 
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metical straight line is fitted.' It is the following expression 
that is minimized: 


Z(log y — log y’)? 


J(e) 


If the logarithm is expanded in a power series, this sum is seen 
to be roughly equivalent to 


which is the same as 


3 8 


Millions of dollars 


Ll "n Ll TY TET SEIEN ME, 

1920 1925 1930 1935 1940 

Fig. 144.—Open-market commercial paper outstanding in the United States, 
1919-1941. Logistic trend fitted by method of least squares. 

For a dying institution, open-market commercial paper out- 
standing showed remarkable vigor in the years 1933-1941, and 
perhaps the monetary economists were premature in their 
predictions. Whether or not they were is a matter for the 
future to reveal. 


1 €f. pp. 566-567. 
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Trend of a Growing Institution. Method of Selected Points 
Illustrated. If the hypothesis made by Raymond B. Prescott 
ean be demonstrated or illustrated in real life, it should surely be 
done by the development of the automobile during the past 
three or four decades. Table 82 and Fig. 145 give an illustration 
of the fitting of a rational trend that purports to represent this 
type of growth, constituting thereby a test of this hypothesis. + 
They also illustrate the method of fitting a logistie curve of the 
Pearl-Reed type by using selected points. 

The equation of the curve may be written in the form 

AEN ES 1 
V ry 8) 
in which m = e*t, 

It is thus required to find three parameters k, a, and b, which 
is more conveniently done by first converting the equation into 
logarithms, as indieated in the work sheet. 

By using annual data, consisting of monthly average output 
of passenger cars and trucks each year from 1903 to 1941, a 
graph was made and from its examination the following selected 
points were adopted: 


1909 | 1922 1935 
ty =0 h = 13 t = 26 
yo =10 | x =250 | ys = 320 


The values of the parameters k, a, and b may be found by using 
the following equations :* 


k = ours — Vilyo + gl 


Voy» — yi 
k — yo 
a = log, —— 
d Yo (9) 
= 1log, yok = yr) 
n (k — yo) 
in which n is defined as ts — f. 


1 Explained on pp. 553-555. 
2 Cf. PEARL, RAYMOND, Studies in Human Biology, (1924), Chap. XXIV; 
The Biology of Population Growth (1925), p. 22. Citations used from 


F. E. Croxton and D. J. Cowden, Applied General Statistics (1939), pp. 444— 
445, 852-853. 


.—T 
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Thus, for the problem illustrated, 
2 X 10 X 250 X 320 — 250 X 250 x 330 


n 10 x 320 — 250 x 250 
_ B(128 — 1,050) _ —7,610 
7" #25 — —2372 
— 320.82630 
a = jog, 320:82630 — 10 
° 10 
= log, 31.082630 
= 2.302585 login 31.082630 
= 2.302585(1.4925178) 
= 3.4366491 
p — 1 Jog, 10(820.82630 — 250) _ 2.302585, 70.82630 
13 "E: 250(320.82630 — 10) 13 810 7 710.6515 


0.1771219 (logi) 0.00911458) 
[logo 0.00911458 = 7.9597368 — 10] 


0.1771219 (—2.0402632) 
—0.3613753 

As indicated in Table 82, the values for m for various values of 
t are conveniently found by the use of logarithms; thus, since 


H 


"I 


m = e 
log m = logo e(a + b) [since logio e = 0.43429] 
= 0.43429(a + bt) 


or, for the example illustrated, 


log m = 0.43429(3.4366491 — 0.3613753/) 
= 1.4925023 — 0.1569417t 


For the year 1909, when ¢ = 0, the value of log m is 1.4925023, 
as may be seen from the work sheet (Table 82), and the values 
of log m for other values of ¢ are obtained by the successive 
algebraic subtraction of the constant —0.1569417 through the 
years preceding 1909 and by successive algebraic addition of the 
constant —0.1569417 through the years subsequent to 1909. 
These are the logarithms of m for the various values of ¢, that 
is, for the various years. In the next column of the work sheet, 
the antilogarithms are entered, which, when added to 1, are 
divided into the constant k in order to find the trend values for 
each year. These steps are shown in the next three columns of 
Table 82. An index of normal, that is, y/y’, is also calculated. 
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TABLE 82.— WORK SHEET FOR CALCULATING INDEX or NORMAL AND TREND 
Logistic trend of the Pearl-Reed type 
Dara: Automobile production in the United States. Annual averages of 
monthly data 
(In thousands of cars) 


ym 
Year| ¢ log m D T ee) eee || a £ 

lcm y 
1903 |—6 | 2.4341525 | 271.7393 272.7393 | 1.176 | 0.9 76.5 
1904 |—5 | 2.2772108 | 189.3201 — [190.3261 | 1.686 | 1.9 112.7 
1905 |—4 | 2.1202691 | 131.9073 — |132.9073 | 2.414 | 2.1 87.0 | 
1906 |—3 | 1.9633274 | 91.90234 | 92.90234 3.453 | 2.8 81.1 | 
1907 |—2 | 1.8063857 | 64.03030 | 65.0303 | 4.933 | 3.7 75.0 
1008 —1 |  1.0404440 | 44.61122 45.61122 7.034 | 5.4 75.8 
1900 | ol  1.4925023 | 31.08152 | 32.08152 10.000 | 10.9 109.0 
1010 | 1| 1.3355606 | 21.65515 | 22.65515 14.161 | 15.6 110.2 
1911| 2|  1.1780189 | 15.08757 | 16.0876 | 19.942 | 17.5 87.8 i 
1912| 3 |. 1.0216772 | 10.51177 | 11.5118 | 27.869 | 31.5 113.0 
1018 | 4|  0.8047355 | 7.32380 8.3238 | 38.543 | 40.4 104.8 
1914| 5| 0.7077038 | 5.10203 6.1026 47.4 90.2 
1915 | 6 | 0.5508521 | 3.55510 4.5551 80.8 114.7 | 
1916 | 7| 0.3939104 | 2.47691 3.4769 134.8 146.1 
1917| 8| 0.2369687 | 1.72571 2.7257 156.2 132.7 
1918| 9 | 0,0800270 | 1.20234 2.2023 [145.675 | 97.6 67.0 
1919 | 10 | —0,0769147 | 0.83769 1.8377 174.581 | 161.1 92.3 
1920 | 11 | —0.2338504 | 0.58304 1.5836 202.588 | 185.6 91.6 
1921 | 12 | —0.3907981 | 0.40063 1.4066 |228.082 | 134.7 59.0 
1922 | 18 | —0.5477398 | 0.28331 1.2833 |249.999 | 212.0 84.8 
1923 | 14 | —0.7046815 | 0.19739 1.1974 267.938 | 336.2 125.5 
1924 | 15 | —0.8016232 | 0.13752 1.1375 282.040 | 300.2 100.4 
1925 | 16 | —1.0185649 | 0.095815 | 1.0958 [202.773 | 355.5 121.4 
1926 | 17 | —1.1755066 | 0.066756 | 1.0008 300.748 | 358.4 119.2 
1927 | 18 | —1.3324483 | 0.046511 | 1.0465 306.568 | 283.4 92.4 
1928 | 19 | —1.4893900 | 0.032405 | 1.0324 (310.758 | 363.2 116.9 
1929 | 20 | —1.6463317 | 0.022577 | 1.0226 313.742 | 446.5 142.3 
1930 | 21 | —1.8032734 | 0.015730 | 1.0157 315.858 | 279.7 88.6 
1931 | 22 | —1.9602151 | 0.010959 | 1.0110 (317.348 | 199.1 62.7 
1932 | 23 | —2.1171568 | 0.007636 | 1.0076 318.394 | 114.2 35.9 
1933 | 24 | —2.2740985 |  0.0053199 | 1.0053 319.128 | 160.0 50.1 
1934 | 25 | —2.4310402 | 0.0037065 | 1.0037 319.640 | 229.4 71.8 
1935 | 26 | —2.5879819 |  0.0025824 | 1.00258320.001 | 328.9 102.8 
1936 | 27 | —2.7449236 |  0.0017992 | 1.00180320.250 | 371.2 115.9 
1937 | 28 | —2.9018653 | 0.0012535 | 1.00125/320.426 | 400.7 125.0 
1938 | 20 | —3.0588070 |  0.0008734 | 1.00087/320.547 | 207.4 64.7 
1939 | 30 | —3.2157487 |  0.0006085 | 1.00061]320.631 | 298.1 93.0 
1940 | 31 | —3.3726904 |  0.0004239 | 1.00042/320.692 | 372.4 116.1 
1941 | 32 | —3.5296321 |  0.0002054 | 1.00030/320.730 | 403.2 125.7 

Salle es eni EE GE CERE ep e 3,788.74 


Source: Data from Statistical Abstract of the United States, 1933, p. 334; and Survey of 
Current Business, Annual Supplement, Vol. 12 (1932), and current issues, passim, 

* k = 320.82630. See p. 579. 

TI the curve had been fitted according to the least-squares criterion, this sum would 
approximate a hundred times the number of years, that is, 3,900. 
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The results lend support to the hypothesis that automobile 
production in the United States had a growth during those years 
following the law of the Pearl-Reed logistic curve. The goodness 
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Fic. 145.—Automobile production in the United States, 1903-1941. Pearl-Reed 
curve fitted by method of selected points. 


of fit of the trend is attested to, not only by the plotting of the 
curve with the data in Fig. 145, but also by the fact that the 
sum of the ratios of the raw data to the trend equals approxi- 
mately a hundred times the number of years. 
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ILLUSTRATIONS OF EMPIRICAL TRENDS 


The distinction between rational trends and empirical trends 
lies, not in the method of calculation, but in the interpretation 
and analytical use of the trend after it is calculated. Yet, in the 
case of empirical trends, it frequently suffices to fit a trend line 
of very simple character. Thus a straight line may be quite 
adequate in some cases. 

Straight-line Trend. Table 83 contains a work sheet for 
calculating a straight-line trend in open-market commercial 
paper outstanding for the period 1931-1941. Rationalization 
of this trend is uncertain—it may be the commencement of a new 
period of growth in what was supposed to be a dying institution, 
"TABLE 83.—Work SHEET FOR CALCULATING INDEX op NORMAL AND TREND 

Straight line 
Dara: Open-market commercial paper outstanding. Annual averages of 
monthly data 
(In millions of dollars) 
Equation of trend: y^ = 206 + 12.49¢ (origin at 1936) 

Source: Annual Report of the Federal Reserve Board, 1929, p. 121; 
1935, p. 174. Survey of Current Business, Annual Supplement, Vol. 20 
(1940), p. 47 and current issues, passim. 


SS? 3 m a mm Om e 
Index of 
š Raw Trend computed 
Year data t D ty Fon? trend 
D | » v 
3 
1931 264 _5 25 | —1,320| 144 183.3 
1932 106 = 16 —424 | 156 67.9 
1933 95 -3 9 —285 168 56.5 
1934 156 -2 4 -312| 181 86.2 
1935 174 zi 1 —174 194 89.7 
1936 188 0 0 0 206 91.3 
1937 296 1 1 296 218 135.8 
1938 239 2 4 478 231 103.5 
1939 - 198 3 9 594 243 81.5 
1940 234 4 | 16 936 256 91.4 
ç 1941 317 TR 5 20 M e 268 118.3 
2,267 OME e 3223 kar pus? EE 
N =1l zy xt zm Sty £ 
X; 
* This total is a cross check on the work sheet; it should equal a hundred times the number 


of years. Failure to check precisely is due to rounding. 
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or it may be merely a cyclical movement. At any rate, for the 
period of 11 years selected the trend analysis makes possible a 
better study of the shorter term cyclical or residual movements in 
the data. 

The work sheet contains all the information necessary to 
calculate the equation of trend, which in this case is of the simple 
form, ui = a + bt. As seen in Eq. (5), the equation is found 
by the following: ; 


From the work sheet, for this particular problem, 


Zy = 2,267 X=0  zxt-—110 2-138244 N=11 


Accordingly, 
2,2615. s 
[ETE 206 
= 1374 _ 
b= Ho = 12.49 


and the equation of trend is y’ = 206 + 12.49¢ (origin at 1936). 
It is necessary to specify the origin in order to know for which 
year ¿= 0. If the origin were 1931, the equation would be 
y’ = 144 + 12.49¢ (origin at 1931); this equation describes the 
same straight line as y’ = 206 + 12.49¢ (origin at 1936). 

Column (6) of the work sheet contains the solutions of the 
trend equation for the respective values of t. Thus, for 1933, 
i = —3, and the solution of the trend equation for that year is 
y' = 206 + (—3)(12.49) = 168. 

Column (7) is the index of computed trend, each y of the raw 
data divided by the corresponding y’ of the trend, and the result 
expressed as a percentage. Thus, 264 is 183.3 per cent of 144. 

Polynomial Empirical, Trends. Laborsaving Devices. It is 
possible to find a second-degree, third-degree, or higher degree 
polynomial trend by the methods already illustrated. To fit a 
second-degree polynomial, according to the least-squares criterion, 
the work sheet would be like that illustrated in skeleton form 
on page 570. But it is better, for practical use, to introduce two 
important sets of laborsaving devices before proceeding to fit 
the higher order polynomial trends. The first set of laborsaving 
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devices has to do with economy of calculation in the work sheet; 
the second set has to do with solving the equations for different 
values of t, therefore, with computing trend values for various 
years. 

Economy of Calculation in the Work Sheet. As already noted, 
an economy was obtained by taking an odd number of years and 
making the median year the origin, so that Xt = 0, Zt? = 0, and 
similarly the total of all odd powers of ¢ will be equal to zero; 
hence, columns in the work sheet for odd powers of ¢ are not 
required. In addition, the entry of columns in the work sheet 
for the even powers of t may be avoided because Xë, Di, 34°, 
ete., can be computed from formulas. It can be shown by 
algebraic derivation! that, if ¢ runs integrally from £= +1 to 


t= +(n — 1), in which n = N + x i.e., n = t (terminal value) 
> g = Mn — Dn — 1) 


De 2 3n? — 3n — 1 
>: = Gg ( š ) (10) 
p = > P =e. = 2n) + 3n + d 
7 


By similar algebraic computation Xf* ean be evaluated in terms 
of n, but it is preferable to use orthogonal polynomials if a trend 
equation of fourth or higher degree is sought. 

A second economy for the work sheet is secured by using a 
subtotal summation procedure by which aggregates Si, So, Ss, 
Ss, etc., are obtained. From these aggregates algebraic formulas 
are used to compute as follows: 


Ey = S. 
EZty = nŠ, — S: 1 
Z = n*8i — (2n + 1)8; + 28; (11) 
Zum = n38, — (Bn? + 3n + 1)8, + 6(n + 1)8; — 6S, 
in which n = E 


2 


1 Of. Ross, Frank A., “Formulae for Facilitating Computation in Time 
Series Analysis," Journal of the American Statistical Association, Vol. 20 
(1925), pp. 75-79. For method of proof see footnote 1 p. 586. 
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Tapug 84—EconomicAL Work SHEET FOR CALCULATING POLYNOMIAL 
TREND, ALGEBRAIC ILLUSTRATION 
Method of least squares 


Yer ` Data Sets of subtotals 
T | tly Fint | Second | Third 
f» (e w © | (6) | e 
| e | Lr SE SK. = SS 
8 | | 
1987| 1|-2| n |n |n Im 
1938 2 —1 v |n+ u EE : [3n +u 
1939/3] 0| ys [m -+m+m (äm + äs s bn + si m 
1940 | 4 | 1| m |ntntn+n Ay: + 3y + 2⁄4 + ⁄4 Jim + Gs + Zus + 1⁄4 
141|8| 2| ys | + p tue + y+ ps. | Si + Syn + äs + Bue + ve 1501 + 10 + bya + 3⁄4 + ys 
| Sı S: Ss Ss 


The subtotal summation process is illustrated algebraically in 
Table 84 and arithmetically in Table 85. The sum of column 
(4), Si, is merely Ey. Column (5) contains the first set of sub- 
totals, which is obtained on the adding machine by taking a sub- 
total after entry of each item in column (4); the first subtotal 
in column (5) will thus be the first item of column (4), therefore, 
yı, the second subtotal in column (5) will be yi + ys, the third 
subtotal will be yı + Y2 + ys ete. Sz is the sum of these 
subtotals. 

The second set of subtotals, column (6), consists of subtotals 
of the figures in the preceding column, column (5); thus the first 
subtotal in column (6) is yı, the second subtotal is 2yı + y», the 
third subtotal is 3yı + 2ys + ys, ete. Ss is the sum of the 
second set of subtotals. 

The third set of subtotals, column (7), consists of the subtotals 
of figures in column (6); and S, is the sum of this third set of 
subtotals. 

This process of taking subtotals and aggregating the subtotals 
by columns to obtain Ss; Ss, S4 can be repeated to as many as 
desired, depending on how high degree a polynomial is to be 
fitted. If carried as far as Sy, a third-degree polynomial can be 
fitted. 

A cross check on the work sheet is noted in Table 85: 5; is 
equal to the final subtotal in column (5), S» is equal to the final 
subtotal in column (6), S; is equal to the final subtotal in column 


(7), ete. 
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TABLE 85.—Economican Work SHEET FoR CALCULATING POLYNOMIAL 
'TREND, ARITHMETICAL ILLUSTRATION 
(Method of least squares) 


Year Data Sets of subtotals 
T t y First Second Third 
a) (2) (3) (4) (5) (6) Wd 
1937 1 -2 2 2 2 2 
1938 2 -1 5 7 9 11 
1939 3 0 | 8 15 24 35 
1940 4 1 7 22 46 81 
1941 nan E M M ur 31 e: _ 158 
31 77 158 287 
Sı Sa Ss Ñ. 


From Table 84 it can be readily seen that algebraically 


Kis yty tyt: d yx 
S2 = Ny + (N — Us + (N — 2) + -+y (12) 


_ IC e M 
s s H SE (N a Ds 


occ tay 

For the coefficient of y; in this sum is equal to the sum of the natural 
N 

numbers from 1 to N, therefore, D: T, which equals M ED, the coeffi- 
1 

cient of y; is the sum of the natural numbers from 1 to N — 1, which equals 


E SS DM ete,! 


SAEED a C; WDM y, v ms 


1 This may be demonstrated as follows: 
BP =1424+3+4+4+-+--4N 
Also, 
ZT =N +N -1)+ (NW -2) +(N —3) +-- `. +L 
By adding, 
22T = (N +1) + (N +1) +(N D +(N 41) + e + (N + 
EEN 


and hence 
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For the coefficient of y: is the sum of = 


NIN + UO +2), ste 
6 Ee 


as N goes from 1 to N; 


this sum equals 


Tt will be convenient to express these sums in the following 
manner: 


Sı = Ly 
8, = Z(N + 1 — Ty 
In the case of y, N + 1 — T = N. In the case of y: 
N+1-T=N-1. 
In the case of y, N+1-—-T=N-—2. Ete. 


ge ya _ G 
In the case of yi, Dt dae za des SES 
the ease of ys, Uia: D duae) LN 3 DN xac. 03) 
(N+1—T)(N+2—7)(N +3 — T) 
y= s = D : 
In the case of yı, 
(N +I — (N 2- T(N 3 — T) | NW + (N +2), 
6 6 
In the case of y2, 
(N -1— T(N +2— T(N F3 — T) .(N UU +1) e 
6 6 ite. 
AT 


In the above equations, if T is replaced by t + T = t + Toms 


YS N+1 d 
and if, by definition, n = LI these equations become 


Ñ, = Ly 

S2 = nZy — Rty 
28, = Z(n — t) Y + nZy — Zt (4) 
68, = X(n — D% + 3Z(n — t)*y + 2nZy — 2Xty 


in which the unmarked X refers to summations with respect to 
= sil i 
1 from + E to — Eli - If these equations are expanded 


and similar terms assembled, Eqs. (11) are obtained. 
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TABLE 86.—Wonk SHEET FoR CALCULATING TREND—SECOND-DEGREE 


POLYNOMIAL 
Method of least squares 


Dara: Consumer expenditures for personal appearance and comfort. 
D D 


Annual data in millions of dollars 


Source or Dara; Survey of Current Business, October, 1942, p. 24. 


Sets of subtotals 
Year t y — — 
First Second 

1929 | —6 | 655 655 655 
1930 —5 | 630 1,285 1,940 
1931 -4 540 1,825 3,765 
19032 | -8 | 427 2,252 6,017 
J058- | -2 347 2,599 8,616 
1934 -1 393 2,992 11,608 
1935 0 441 3,433 15,041 
1936 1 503 3,936 18,977 
1937 2 545 4,481 23,458 
1938 3 543 5,024 28,482 
1939 4 540 5,504 34,046 
1940 5 568 6.189. shee 40,178 
1941) 6 653 | 6,785 | 46,963 
6,785 46,063 | 239,746 

| 5 | Ss l S; 


By using Eqs. (11), page 584, the following values are obtained: 


Dy = 6,785 
Xty = 7(6,785) — 46,963 = 532 
Xy = 49(6,785) — 15(46,963) + 2(239,746) = 107,512 


By using Eqs. (10) the following values are obtained: 


— 7(6)(13) 


se = Aes) _ 
t 3 182 
Š 147 — 21 — 
ze = 182 HI A D L 182(25) = 4,550 


To find the second-degree polynomial trend equation these values 
may be substituted in Eqs. (7), G) to (iii), page 570, as follows: 


(i) 6,785 = 13a + 182c 

(ii) 532 = 182b 

(iii) 107,512 = 182a + 4,550c 

G) 94,990 = 182a + 2,548¢ (i) X 14 
12,522 = 2,002c 


II 


2.923 


6.2547 
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(iii) 107,512 = 182a + 4,550c 
(i^) 169,625 = 325a + 4,5500 — (i) X 25 
62,113 = 143a a = 434 


Accordingly, the second-degree polynomial equation of trend 
for the problem illustrated is y' = 434 + 2.923t + 6.25470 
(origin at 1935). 

Finding Trend Values by Method of Finite Differences. The 
equation for a trend line having been found, the problem is to 
compute from this equation the values of y’ pertaining to a given 
set of years. Direct substitution is laborious. Finite differ- 
ences provide an easier method. The keystone of the latter 
method is the fact that the nth difference of a polynomial of the 
nth degree is constant. Hence, that constant nth difference 
having been determined, the other differences, and ultimately 
the desired trend values themselves, can all be computed by 
merely reversing the differencing process, 7.e., by simple addition. 

In the equation y’ — a + bt + ct? the first difference, by 
definition, would be 


Aly = a + b(t + 1) + c(t + 1) — a — bt — c? = b + 2ct + c 
and, by definition, the second difference would be 
Am = b + 2c(t + 1) + c — b — 2ct — c = 2c 


Taste 87.—Burpine Up a POLYNOMIAL BY FINITE DIFFERENCES 


w @) @) z 6) © 

= — e = 

; , x x Polynomial 

Fourth hird Second First ` 

t AE, AË es SEH difference und) 
Ay" An Am Aly! is 
—4 =1 A —48 220 305 
=S —1 5 —42 172 525 
—2 Ka 4 —87 130 697 
—1 -—1 3 —33 93 827 
0 -1 2 —30 60 920 
1 1 —28 30 980 
2 A F: E 2 1,010 
3 Së | ç s —25 1,012 
1 | 987 
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Table 87 illustrates the building up of a polynomial by finite 
differences. The polynomial here is of the fourth degree, and 
hence its fourth differences are all identical. 

A figure in a given line of column (6) algebraically added to 
the figure in the same line of column (5) gives the next figure for 
column (6); thus, 305 + 220 = 525, 525 + 172 = 697, etc. 
Similarly, a figure in a given line of column (5) added algebraically 
to the figure in the same line of column (4) gives the next figure 
for column (5); thus, 220 — 48 = 172, 172 — 42 = 130, etc. 
The same general rule applies to the figures in columns (2) 
and (3); thus, —48 + 6 = —42, —42 + 5 = —37, etc., and 
6 — 1 = 5, 5 — 1 = 4, ete. 

In the polynomial illustrated in Table 87, the polynomial is 
known to have a constant fourth difference. Hence, if the 
polynomial value and the differences of any one line are all 
known, then the differences and polynomial values for all other 
lines above or below the given line can be readily computed. 
Thus, if fort = 0, it is known that the polynomial value y' = 920, 
Aly), = 60, A*y, = —30, A% = 2, and the constant fourth dif- 
ference is equal to —1, then, by working from right to left and 
up and down, the other values in the tables can be built up. 
The first set of variable differences, in this case the third, can be 
built up cumulatively from the known Au, = 2 and the con- 
stant difference —1. It is to be noted that in a downward 
direction, in this table, this constant difference is —1; so in 
building up from the bottom to the top the constant difference is 
algebraically —(—1), or +1. This rule follows also for the 
building up of the other differences. 


TABLE 88.—A1ip For COMPUTING FINITE DIFFERENCES AT í = 0 IN 
PorxNowiaL y' =a+bt+ ct? +d p. 


0) (2) (3) (4) (5) 


Parameter A 
Ayo And Ayo I An 
b at 
c 1 2 
d 1 6 6 
e 1 14 36 24 
f 1 30 | 150 240 120 
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This method is of general validity and can be used to find 
values of a polynomial of any degree from knowledge of its 
value for one year and its differences for that same year. For- 
tunately, it is relatively easy to caleulate the y' or polynomial 
value and the various differences for the year 4 — 0. If the 
form of the polynomial is y’ = a + bt + c? + d? + ett + >, 
then the polynomial value for / = Ois y = a. The first, second, 
and higher order differences for 4 = 0 can be computed with the 
help of Table 88. 

The figures in Table 88 give the weights by which the param- 
eters b, c, d, e, f, ete., must be multiplied to give the difference 
specified at the top of each column, as follows: 


Ayj=bt+e+dte+::- 

Ay = 2c + 6d + 14e 4- 30f +- 
Ayo = 6d + 36e + 150f-+ : : ° 
Am = 24e + 240f + >>: 
AS, = 120f/+ -> 


For a partieular polynomial, eaeh of these equations, of course, 
terminates with the coefficient of £. Thus, for a second-degree 
polynomial y = a + bt + ct°, the formulas for the differences at 
¿= 0 would be Aly, = b + c and A?y; = 2c, the higher differ- 
ences being zero since the second difference is the same for all 
values of £. For a third-degree polynomial, 


y = a+ bt + c? + dt 


the formulas would be Ain, = b + c + d, Ai, = 2c + 6d, and 
Ain, = 6d. For a fourth-degree polynomial 


y = a + bt + ct? + dt + ett 
the differences would be Ain, =b+c+d+e, 
Am, = 2c + 6d + 14e, A*y, = 6d + 36e 


and Ain, = 24e. 

For higher degree polynomials, the table can be readily 
extended by the rule that a figure in a given line of a given column 
is equal to the number of the column multiplied by the sum of 
the two figures in the line above situated in the given column 
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and in the column to the left, respectively. For example, 
36 = 3(6 + 6); 24 = 4(0 + 6); ete.1 

The use of finite differences to compute the trend values of a 
second-degree polynomial is illustrated in Table 89. The trend 
is y' = 434 + 2.923t + 6.25470? (origin at 1935), calculated 
above in Table 86 for data on consumer expenditures for per- 
sonal appearance and comfort in the United States, 1929-1941 

In Table 89, the constant second difference is known to be 
12.509; the first difference for £ = 0 is 9.2,* and the trend value 
for t = 0 is y' = 434.0. These are first entered in the work 


TABLE 89.—Wonk SHEET ror Computinc TREND Vanuks BY METHOD or 
FINITE DIFFERENCES 
Equation of trend: y’ = 434 + 2.923¢ + 6.2547? 
Value of y, = 434 
Value of Aly, = 2.923 + 6.2547 = 9.1777 


I 


Constant Au! = 12.509 
Year t Ad Aty’ | v 
EIE E ex c mp — ee amm Z. f 2. SZ, A 
1929 -6 Zo —65.9 641.5 
1930 -5 €. —53.4 | 575.6 
1931 —4 ce —40.8 522.2 
1932 -8 vid —28.8 481.4 
1933 -2 US —15.8 453.1 
1934 -1 age —3.3 437.3 
1935 0 12.509 9.2 434.0 
1936 1 217 443.2 
1937 2 34.0 | 464.9 
1938 3 46.7 | 499.1 
1939 4 50.2 | 545.8 
1940 BUM Fas conte 71.7 | 605.0 
1941 6 See | 676.7 
— l i 


sheet; then, since the constant second difference is positive, the 
remainder of the column of first differences is obtained by suc- 
cessively subtracting 12.509 to obtain first differences for earlier 
years and by successively adding 12.509 to obtain first differences 
for later years. Obtaining the trend values is illustrated as 
follows: 434.0 + 9.2 = 443.2, 443.2 + 21.7 = 464.9, ete.; for 
values before 1935, 434.0 — (—3.3) = 437.3, 437.3 — (—15.8) 

1 Cf. WHITTAKER and ROBINSON, op. cit., pp. 1-7. 

* The first differences may be rounded without causing cumulative error. 
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= 453.1, ete. Beginning at the top of the table, it will then 
be found that 641.5 — 65.9 = 575.6, 575.6 — 53.4 = 522.2, ete. 

While the explanation of the method of finite differences may 
be extended, its use in solving second-degree polynomials for 
various values of £ is much more expeditious than the method of 
obtaining solutions to the equation for the various values of t by 
substitution in the equation. The labor involved in the longer 
method is great if the number of years is large or if the poly- 
nomial is of higher than a second degree. In contrast, the 
method of finite differences may be used without difficulty, and 
the arithmetic involved is always simple addition or subtraction. 

A danger inheres in the use of finite differences, namely, that 
any error in the higher order differences is cumulated as the lower 
order differences are- computed. For this reason, when the 
trend line is determined the coefficients of the higher powers of 
t should be carried out to a larger number of places than would 
be regarded as significant. If, for example, the coefficient of ( 
is rounded off to the fifth place, the maximum error in the fourth 
difference is 24 X 0.000005 = 0.00012, over a 7-year period. 
Tf the other coefficients have also been rounded off to the last 
place indicated, then the maximum error in 


Amt, = 6(0.00005) + 36(0.000005) = 0.00048 
in 
And = 2(0.0005) + 6(0.00005) + 14(0.000005) = 0.00137 
in 
Aly’, = 0.005 + 0.0005 + 0.00005 + 0.000005 = 0.0055555 


TABLE 90.—MAXIMUM CUMULATED ERRORS IN DIFFERENCES AND PoLv- 
NOMIAL VALUES 
Error in Aty’ = 0.000120 


Ay i Aty’ aty’ Dé 


| 0.000480 | 0.001370 0.005555 0.050000 
0.000600 0.001850 0.006925 0.055555 
0.000720 0.002450 0.008775 0.062480 
0.000840 0.003170 0.011225 0.071255 

0.004100 0.014395 0.082480 

0.018495 0.096875 

0.115370 


e om eng E 
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and in y} = 0.05. Thus, by the time yẹ the seventh year, for 
example, has been computed, the maximum error in that figure 
becomes 0.11-+. This is shown in Table 90. 

The final error grows larger the further the work proceeds and 
thus makes it necessary to compute the coefficients of higher 
powers of ¢ to several figures beyond the number of significant 
figures required in the computed trend values. A cross check 
on the method of finite differences would be to solve the poly- 
nomial equation for the terminal values of t. 

The danger of cumulative error is reduced to a minimum by 
starting at t = 0 and accumulating upward through the —/'s 
and accumulating downward through the Lis, 

Analysis of Cycles by Empirical Trends. Data on plate- 
glass production in the United States, 1933-1941, have been 
selected, in order to illustrate how cycles may be studied by 
empirical trend analysis. Table 91 is a work sheet providing the 
figures needed to compute either a straight-line trend or any 
polynomial trend up to the third degree. 

TABLE 91.—Wonx SHEET ror COMPUTING TREND AND INDEX op NORMAL 
Method of least squares 
Dara: Production of plate glass, polished, in the United States 
(In millions of square feet, monthly) 

Source: Survey of Current Business, Supplement, Vol. 20 (1940), p. 151; 

Vol. 21 (February, 1941), p. 99, Annual (March, 1942), p. 8-35. 


Sets of subtotals 
Year t D 
First Second Third. 
Y 

1933 —4 7.2 7.2 7.2 7.2 
1934 —3 7.9 15.1 22.3 29.5 
1935 —2 15.0 30.1 52.4 81.9 
1936 EL 16.5 46.6 99.0 180.9 
1937 0 16.0 62.6 161.6 342.5 
1938 1 7.1 69.7 231.3 573.8 
1939 2 11.8 81.5 312.8 886.6 
1940 3 13.7 95.2 408.0 1,294.6 
1941 | ATE, 189 stet 519.1 1,813.7 
111.1 519.1 1,813.7 5,210.7 

Sı Bs Ss Sa 


By using Eqs. (10), the values of Si, Zt*, and Zt? are found 
as follows: 
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> mm ie p — 1) S 


Yel y SE SS GE E 


3n*(n* — 2 + änt 1 


YXe-Y» 


a Io _ W) $ 15 + '] Des E = 60(163) 
= 9,780 


By using Eqs. (11), the values of Ety, 2, and Zí^y are found 
as follows: 


yy = S = 111.1 

Bly = nS, — Sa = (111.1) — 519.1 = 36.4 

Sty = n8, — (2n + 1)S2 + 285 
= 25(111.1) — 11(519.1) + 2(1,813.7) 
= 694.8 

Diy = n38; — (8n? + 3n + 1)S2 + 6(n + 1)Š; — 68, 
— 125(111.1) — 91(519.1) 4- 36(1,813.7) — 6(5,210.7) 
= 678.4 


From these, two trend lines may be computed, first a straight 
line, and second a third-degree polynomial, as follows: 
Straight-line trend: 
at n 1, 36.4, 
n= Ki 60 
yi i + 0.6075 (origin at 1937) 


"I 


'Third-degree polynomial trend: 
'The normal equations are 


Sy = Na + b2t + cdl? + dzt* 
Ely = aXt + bX + c2’ + dz 
z = aXt? + bEC + cti + adzi 
Ey = adt? + btt + cdl + dx 


in which all the sums of the odd powers of t are equal to zero, 
so that the equations for finding a, b, c, d are as follows: 
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(i) 111.1 = 9a + 60c 

(ii) 364 = 60b + 708d 

(iii) 694.8 = 60a + 708c 

(iv) 678.4 = 708b + 9,780d 

(ii^) 429.52 = 708b + 8,354.4d — (ü) X 11.8 

(iv) — Gi) 248.88 = 1,425.6d d = 0.17457 

(iii) 694.8 = 60a + 708c 

(i^) 1,310.98 = 106.2a + 708c @) X11.8 

G) — (i) — 610.18 = 46.20 a = 13.34 
Substituting d in Eq. (ii), b = —1.45325 
Substituting a in Eq. (i), c — —0.14891 


The third-degree polynomial trend equation is thus 
y, = 13.34 — 1.45325 — 0.14891£? + 0.174570 (origin at 1937) 


By using the method of finite differences to solve for various 
trend values, from Table 88 above, at t = 0, 


yi = 13.34 
and 

Aly, = —1.45325 — 0.14891 + 0.17457 
= —1,42759 

An, = 2(—0.14891) + 6(0.17457) 
= 0.7496 

Ay, = 6(0.17457) 
= 1.04742, 


which is a constant difference in this case. 

In Table 92, trend values are built up for the problem by using 
the method of finite differences. First, opposite £ = 0, the 
value of y', the first, second, and third differences are entered. 
The constant third difference, 1.04742, is then subtracted suc- 
cessively in the —1 direction (upward in the table); it is then 
added successively in the +t direction (downward in the table). 
For example, starting at £ = 0, the second difference is 0.74960; 
the second difference at 1 = —1 is 


0.74960 — 1.04742 = — 0.29782 
the second difference at t = ~2 is 


— 0.29782 — 1.04742 = — 1.34524; ete. 
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Starting again at ¢ = 0, the second difference is again 0.74960; 
the second difference at t = +1 is 
0.74960 + 1.04742 = 1.79702; ete. 


The column of first differences is built up from the column of 
second differences. For example, starting at ¢ = 0, the first 


20 


y: 43.34 -L453 -ONIE 0/741 
(Origin at 1937) 


Millions of square feet 


0 
1933 1934 1935 1936 1937 1938 1939 1940 194) 


Fia. 146.—Production of plate glass, polished, in the United States, 1933-1941. 
Straight-line and third-degree polynomial trends shown with raw data. 


difference is —1.42579; the first difference for £ = —1 is then 
—1,42579 — (—0.29782) = —1.12797, the first difference for 
t= —2 is —1.12797 — (— 1.34524) = +0.21727; ete. Again, 


"TABLE 92.—Wonx Super ror FINDING TREND VALUES BY Mernop or 
FINITE DIFFERENCES 
Equation of trend: y, = 13.34 — 1.453254 — 0.148912 + 0.174571* 
(origin at 1937) 


Year t Ans Atys Alya us 

1933 —4 ELLAS — 3.44008 6.05001 5.59 
1934 —3 | —2.39266 2.60993 11.64 
1935 E —1.34524 0.21727 14.25 
1936 SE —0.29782 | —1.12797 14.47 


1937 0 1.04742 0.74960 | —1.42579 13.34 


1938 1 1.79702. | —0.67619 11.91 
1939 2 2.84444 1.12083 11.24 
1940 3 3.96527 12.36 
1941 C Pal com E crue s re ie tee 16.32 
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starting at t = 0 with the first difference —1.42579, the first 
difference at t = +1 is —1.42579 + 0.74960 = —0.67619; the 
first difference at t = +2 is —0.67619 + 1.79702 = 1.12083; etc. 
The values of y, are found from the first differences in exactly the 
same manner as the first differences from the second differences. 

The results of the trend analysis are shown graphically in 
Fig. 146. If it can be assumed that the period of 9 years covered 
by the whole period is a segment in a longer cyclical movement, 
the straight-line trend may be considered to measure a part of 
that longer eycle—part or all of its upward movement. The 
shorter cycle is then shown by the polynomial trend. Plate- 
glass production appears to have gone through one complet 
short cycle from about 1934 to about mid-1940, 


CHAPTER XXII 
ORTHOGONAL-POLYNOMIAL TRENDS 


Great economy in trend analysis is secured by the use of 
orthogonal polynomials, especially if the trend desired is of 
higher degree than second-degree polynomial. It requires con- 
siderable space to explain and describe the method of orthogonal 
polynomials, which may seem to belie the fact of its economy in 
use, but the actual arithmetic of application is simple. When 
lines of regression involving more than three coefficients are 
fitted to time series by the least-squares criterion, the work of 
computation by the ordinary method increases very rapidly. 
Laborsaving devices introduced in the preceding chapter, includ- 
ing the use of the summation work sheet and the determination of 
Xt, Dii, 209. ete., by formula, help to keep the amount of cal- 
culation at a minimum; but further reduction in the amount of 
calculation and particularly in the magnitude of the figures that 
have to be handled is obtained by using orthogonal polynomials. 

A “polynomial” is an algebraic expression of the form 


a + bt + ct? 


which, for example, is a polynomial in ¢ of the second degree. A 
polynomial in ¢ of the fourth degree would be 


a + bt + ct? + dt* + et! 


and so forth. “Orthogonal” polynomials are polynomials that 
bear a certain relationship to each other, to be described below. 
The use of orthogonal “polynomials involves merely a special 
method of computing the coefficients of a trend line; the method 
of fitting is still the method of least squares. 

One of the greatest advantages of using orthogonal-polynomial 
trends is that, if the investigator decides to fit either a higher or 
lower degree trend line than what he has already derived, the 
amount of work involved in these further calculations is reduced 
to a minimum. In fact, no extra work at all would be required 
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to determine the equation for a trend line of lower degree, while 
the determination of an equation for a trend line of higher degree 
would require only the calculation of quantities pertaining 
directly to the added term and would not necessitate any recal- 
culations of other quantities. The work already done will 
therefore not be wasted. i 

Orthogonal Polynomials. Suppose a variable / has a set of 
values, say from 0 to 3. If each of these values is substituted in 
a polynomial in ¢, the polynomial will take on a corresponding 
set of values. Thus, if pı = t — 1.5 is a given polynomial in /, 
then, as ¿ has the values 0, 1, 2, and 3, p; has values — 1.5, 
—0.5, +0.5, and +1.5. Another polynomial in ¢, say 


po =P — 3i + 1 


will have a different set of values; in this instance, it will have 
the values 1, —1, —1, and 1 when ¢ has the values 0, 1, 2, and 3, 
respectively. 

Orthogonal polynomials are those that bear special relation- 
ships to each other. The necessary condition for two poly- 
nomials to be orthogonal to each other is that the sum of their 
product for all values of ¢ shall be equal to zero. That this 
necessary condition is met by pı = t — 1.5 and p, = t? — 3t + 1 
is readily seen. Thus, when ! = 0, 


pipe = (t — 1.5)(P — 3t + 1) = —1. 


when ¢=1, pips = +0.05; when ¢ = 2, pip: = —0.05; and 
when t = 3, pips = +1.5. Hence, 


2pipe = —1.5 + 0.5 — 0.5 + 1.5 = 0 


The polynomials p; = ¿— 1.5 and ps = t? — 3( + 1, accord- 
ingly, possess the orthogonal property. 

In general, if a set of polynomials in £, say pi, ps, pa, © > «po 
form an orthogonal set, then it is necessary that 


[zi 


Eppe = 0 ZXpyps = 0 tee t Dpp, = 0 
Epsps = 0 Upp, = 0 t9 o rett, = 0 (1) 


Eppa = 0 Xpsps = 0 EK Ipp, = 0 


These are the general conditions that must be satisfied by 
orthogonal polynomials. Notice that they are equivalent to the 
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conditions that the correlation between each pair of polynomials 
is zero. 

Trend Line in Orthogonal Polynomials by the Method of Least 
Squares. ‘The form of a trend-line equation that has so far been 
used is y =~a+bt+ct??+dt®.... This is an arbitrary 
form, however, and it is to be noted that other forms of the 
identical equation are possible. This can be illustrated numeri- 
cally as follows: 

The equation y’ = 105.3 + 8.14 — 0.77? is identically the same 
as y’ = 115 + 6(¢ — 1.5) — 0.7(Ë — 3t + 1), which may be - 
proved by multiplying out the expressions in the latter equation 
and collecting like terms. If the use of the second form has any 
advantage over the use of the first, there is no reason why it 
may not be adopted. 7 

Suppose, now, that instead of fitting a trend line in the form 
ui =a + bt + ct? + dit, the fitting process is carried out with 
respect to the form 


y = A+ Bpi + Cp: + Dp: + Eps 


in which pı, pe, ps, and p, are polynomials in t of the first, second, 
third, and fourth degree, respectively, that are orthogonal to 
each other and to unity, that is to say, where p; is a polynomial 
in ¢ of the form:pi = hu + t, pa is a polynomial in ¢ of the form 
pa = Kao + kait + t, ps is a polynomial in. £ of the form 


ps = kao + bat + kat? + ts, ete. 


and where Sp: = 0, Ep; = 0, Ep; = 0, Ep, = 0, and Zpips = 0, 
Epis = 0, Epipi = 0, Vp2ps = 0, ete. With reference to the 
arithmetical illustration given above, which was a third-degree 
polynomial, this is equivalent to deriving a trend line of the form 


y' = 115 + 6(t — 1.5) — 0.7(2 — 3t + 1) 


instead of the usual form y = 105.3 + 8.14 — 0.78. 

Either method will, of course, give the same result; for, 
whichever form is derived, it ean be converted into the other by 
simple algebra. It is the purpose of this section to show the 
simplification gained by using the orthogonal-polynomial form 
rather than the usual form. The problem of finding the forms 
of the polynomials themselves, i.e., the values of the k coefficients, 
will be left for a subsequent section. 
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Tf a trend line is put in the orthogonal-polynomial form 
y = A + Boi + Cp: + Dp: 


and is then fitted by the method of least squares, i.e., if A, B, C, 
and D are determined so that 


Zum — y) = äm — A — Bp: — Cp: — Dps)? 
is made a minimum, the following conditions are obtained: 


Zum — A — Bpi — Cp: — Dp) = 0 
Zp(y — A — Bp, — Cp: — Dps) = 0 
Zp:(y — A — Bnp — Cp: — Dps) = 0 
p(y — A — Bp: — Cp: — Dps) = 0 


or 
Xy = NA + BZpi + C2ps + Dën 
Spry = Api + BZpi + CZpips + DE pips 
Zpa = A 2p: + B&Upips + CEpi + DZpsps 
Epyy = AZp; + BXpips + CZpsps + Dën 


But since 1, pi, pa, and p; form an orthogonal set (by assump- 
tion), it follows that Zp; = 0, 2p: = 0, Zps = 0, Zpips = 0, 
pips = 0, and Zpsps = 0. Hence the above equations reduce 
to 


Xy = NA 
Spy = BZpi 


and 


and therefore 


Ul 


ae @) 


The simple form of these solutions will be noted. It will also 
be noted that the solution for A is independent of pı, ps, and 
ps and that the solution for B depends only upon pı, the solution 
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for C only upon p», and the solution for D only upon ps. This 
means that the value of A would have been the same whether a 
first-, second-, or third-degree trend line had been fitted. Simi- 
larly, the value of B would have been the same whether a first-, 
second-, or third-degree trend had been fitted, and the value of C 
would have been the same whether a second- or third-degree 
trend line had been fitted. For if y’ = A + Bp, had been fitted, 
the solutions would still have been 


A =2Y ` _ py 
A= N and B= pt 


If ui = A + Bn + Cp: had been fitted, the solutions would 
still have been 


_ Zu _ =piy , = Zait 
A= N. B= xp? and C= xpi 


The addition of the term Cp. does not therefore change the 
values obtained for A or B, and the addition of the term Dps 
does not change the values obtained for A, B, or C. It also 
can be seen that if a fifth term were added to the trend line, 
namely, Epi, making it a fourth-degree trend, the value of E 
would be given by E = Zp,y/Xpi and the values of A, B, C, 
and D would be the same as before. It is this simplicity and 
independence of the solutions of the least-squares equations 
when orthogonal polynomials are used that give the orthogonal- 
polynomial method its main advantage over the ordinary 
method. 

Forms of Orthogonal Polynomials Used. The forms of the 
orthogonal polynomials to be used for fitting trends ean be 
generalized; what is required is to find the A's in terms of the 
given values of ¢ and the number of years involved. ‘The con- 
dition has been laid down that pi = Kio + f, p» = kaa + kat + Ë, 
and ps = kay + kat + kst? + t, ete., are to be polynomials of 
the first, second, and third degree in /, respectively, that are 
orthogonal to each other and to unity. "The problem is to make 
use of this condition to determine values for the k’s in terms of 
the given values of 4. When this is done, it will be possible to 
find the actual values of A, B, C, and D, from the formulas of the 
preceding section. 
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By way of illustration, the forms of only p, and ps will be 
derived; the method can be readily extended to the determination 
of the forms of ps and of higher polynomials. 

First, it is assumed that the time intervals T are measured 
from the mean T, so that p; and ps become pi = kis + t and 
P2 = ka + kut + Ë, where t= T — T. In addition, it is sup- 
posed that the time intervals to which the variable refers are 
equally spaced and without interruptions, According to these 


assumptions, ¢ will have a mean of zero; its highest value will 
N — : -—1* A 
be + 2 i and its lowest value — al 3 i For example, if 


there are 5 years of data, the middle year will be 0, the first 
year —2, and the last year +2. If there are 4 years of data, the 


first year will be — 2: the second year — à ; the third year 4- 1 
and the last + > 


Accordingly, all the odd moments of t, such as Zt/N, Xt*/N, 
and Zi5/N, will be zero; the even moments, such as X/*/N, 
Z/N, and Xt°/N, are computable from simple formulas depend- 
ing entirely on N, the number of years, as already noted in 
Chap. XXI.! 

With these assumptions, the derivation of the form of the 
orthogonal polynomials, that is to say, the derivation of the 
values of ki ka, and Fa, may now be undertaken. The con- 
dition that p, ps, and 1 shall be orthogonal to each other requires 
that Zpi = 0, Zps = 0, and =pip2 = 0. These equations may 
be written as follows: 


Zpi— ZR) = Nko + Bt = 0 (i) 
Zps = Elko + kat + Ë) = Nko + kat + DA = (ii) 
Eppa = palki + D = kioEps + Epa = 0 ^ (i) 


From these equations, the values of the Ee can readily be 
obtained. Since Xi = 0, (i) gives Nkhb = 0, or ki = 0; and 
Eq. (i) gives Mka + 21è = 0, or ko = — =". From Eq. Gi), 
it is known that Xp = 0; hence, Eq. (iii) becomes Znd = 0. 
Substituting the equivalent of ps, this gives the condition, 


Epi = = (kao + knt + P) — kszt-- kaz + se = 0 (iv) 

* These assumptions were made in the preceding chepter. Cf, Tables 
84 to 86, Chap, XXI. 
1 (f. pp. 584, 586. 
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Since both Zt and Zi? are equal to zero, this becomes 
katt? = 0 


and hence ko must be zero, since Z/? is not. The values of the 
k’s, therefore, are as follows: 


ku =0 
ko = 0 
21 Si (3) 
ko = — — 
20 N 
and the forms of the polynomials pı, ps are therefore 
p-t 
ze? 4 
pp. 4) 
for 
PË = n(n — 1)(2n = 1) 
3 
in which 
n= SE 
RK Ze 
Hence, 
Bt Nd 
IND 12 
Aecordingly, 
Die AER 
che 12 


Similar methods of analysis may be used to derive the forms 
of p; and higher polynomials. The results obtained for poly- 
nomials up to the fifth degree may be listed as follows:! 


pi = 1 
E WAS. Y d (5) 
pine SERB yy 3(N? — Dor 2) 
eo Qe =) p i ISN = Lux +407 , 
* (f. Eqs. (10), Chap. XXI. 


1f. FisnER, R. A., Statistical Methods for Research Workers, Section 27. 
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Thus, it is to be noted that a trend line can be fitted in two dif- 
ferent forms, by the method of least squares; it can be fitted 
in the form y’ = a+ bt + ct? + d? + ... (where t = T — T) 
by the methods described in the preceding chapter, or it can be 
fitted in the form y’ = A + Bpi + Cp: + Dp: + . . . by the 
method of orthogonal polynomials. If the orthogonal-poly- 
nomial form is used in the fitting process, the ordinary form of the 
trend equation can readily be derived from the results; it should 
be repeated that the criterion of fit in each case is the least- 
squares criterion. 

Calculation of the Coefficients A, B, C, .... If the values of 
Du D» D» . . . given in Eq. (5) are substituted in formulas 
for A, B, C, ete. [Eqs. (2)], the following values are 
obtained: 

A = X 


=N 
12 
B PL 
verd 180 2 un NES 
NO? = DOS — 4) y 12 > d 
2,800 


D 


= N(N? — Diere — 4)(N? — 9) 


(Sev - 25575) w) 


e 44,100 (6) 
N(N* = 1)(W? — 4)(N* — 9)(N? — 16) 


, 3N? 13 3(N* — 1)(N? — 9 
(X Hi i » yt SE > v) 


E 698,544 
N(N? — 1)(N? = 4)(W? — 9)(N? — 16)(N? — 25) 


(X Ma, 


q 15N: — 230N2 + 407 
1,008 2 ty 


, 


In order to illustrate the algebraie procedure by which the 
above formulas are obtained, the formula for C will be derived, 
as follows: 

 Zpen 
C= SH 
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But 


Hence, 


x = N? =pl 1 
y DR y 
— EE 
= eege 
> t 2N >> DC 144 N 


7 N? — 
The formula for x H S; However, i is 5 


ss D -1) 


Pr S zm, N?—1. (3N? — 7) 
Likewise, the formula for v’ T 9 


1 
- and hence 


» and hence 


ye NNE NAS 
12 20 


Therefore the denominator of C becomes 


ION len (SN? 7) CNEL) NGN2 a1) de NIS 


12 20 12 12 144 a 


Taking N(N? — 1)/12 out of each term, 


Nar D [se 3 2(N* — 1) , (N° a2 
12 2p 12 12 


which readily reduces to 


NN? — 1) (NN? — 2) NOU = DU = 3) 
12 15 180 


Thus C has the formula given above. The formulas for the 
other coefficients can be obtained in the same way. 
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Equations (6) could be applied by using a work sheet with 
columns for the product terms indicated in order to obtain 
Ziy, Zt*y, Zu, ete. Greater economy is obtained, however, by 
using the subtotal summation type of work sheet illustrated in 
the preceding chapter. By using such a work sheet, an expe- 
ditious method that involves only addition and is self-checkiny 
has been evolved for finding A, B, C, . . . . A brief description 
of this method, together with the mathematical analysis tha! 
justifies its use, will now be given. 

zy 


a is defined as K so that 


Na = 8, (7) 


and a’ is defined as equal to œ. Accordingly, 
A» A=a=a 
From Eqs. (14), Chap. XXI, 


N+1 
Sy = 2 DEEST 


and, by the definition of a, 


N(N +1 
s NOE D, V 


If 8 is now defined as 


"NW ED p 
then 


GEN MOS AT EE 
Pond NU pp," 


But Zíy = Xpy, since pı =t; and if 6' is defined as 
8' = 2Zpw/N(N + 1), 


Beat 
and 


B =a-8 (ii? 
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122p 


N(N* — 0’ it follows that 


Since B = 
6 3 S 
E Gv) 


Again, from Eqs. (14), Chap. XXI, it is found that 


as, = yy (Nt - dun > (“Ht - de 
= NEM ya nur es 


in which o and 8 may be substituted for equivalents, so that 


N 1)? 1)2 
28, = ND" y, NN ED Lg + >) ty 


NOI, ERD Sy 


= UC D*, NEON Dt Y y 


But 


Yow - X (e Fu) Sap Uy, 


Hence 


ey = ELE z Dv- dry + D, 


Therefore, making substitutions in the above value of 285, 


SE EE 


+ >) py 


285 
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—3N(N + 1 + N(N + DUR — 1). 
12 


285 


rs asas ede 


-N(N EUN +2), o NW + EE E X a 


Now, if y is defined as 


6 : : 
7?” VQ + YN + 2) °° m 
then 
2N(N + 1)(N + 2, EEN x 
6 6 
4 3NQ + Deci 2,5 i 3 Sa 
4 ty d za p etl LAE A 
And if y' is defined as y’ = NW + DQ + 3) » psy; 
then 
2y = —a-F38-F Y 
and 
Yy' = a — 38 + 2y (vi) 
and since C = WES) X pay, it follows that 
30 ; ER 
PE NESSUN ES m (vii) 
In the same manner, it can be shown that if 
ign. 24 Ja 
NON + DON + 2)(N + 2,5 (viii) 
and 
~ NW + Dor : 2N +3) 2 Pu 
then 
à' = a — 68 + 10y — 55 (ix) 


140 
(NE DON - 30€ E $3 
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As a result of the above analysis, from Eqs. (i), (ii), (v), and 
(viii), the following formulas are obtained: 


co 

N 

a can 
"ru 


I 


24 (8) 
~ NW + D(N YN + 27 
120 
NN + DQ + 200 8) + 4) ° 
720 
NO € DOCE 20 HW + DW + 9 ° 


€ 


A 


The values of e and of A are indicated by extension, since the 
symmetrical pattern of these formulas is readily apparent. 
The numerators run 2!, 3!, 41, 5!, 6!, 7!, etc., and the denomi- 
nators run N, N(N + 1), N(N + 1)(N + 2), 


N(N + 1)(N + 2)(N + 3), ete. 


Similarly, from Eqs. (i), (ii), (vi), and (ix), the following 
formulas are obtained "2 


Cs = G 

B8'—a-—f 

Y = a — 38 + 27 (9) 
A = a — 68 + 10y — 56 š 
€ = a — 108 + 30y — 356 + 14e 

M-a-— 158 + 70y — 1406 + 126e — 42d 


and from Eqs. (i), (iv), (vii), and (x), the following formulas are 
obtained; 


1 For additional equations, see Fisher, op. cit., or George W. Snedecor, 
Statistical Methods (1940), pp. 324-334, where the procedure is applied to 
problems of curvilinear correlation in which probability interpretation is 
valid. 
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A ss oi 
1 6 " 
Pic re 
30 5 

° (= DIN S) 

£ EE US (10) 
Hmmm a e? 
A 630 oF 

“W =Q SQ 3) 4) 
" 2,772 = 


"ON DW 2 = 3)(V CN — 5) 


Tables to Be Used in Orthogonal-polynomial Analysis to Save 
Calculations. All the explanation necessary for the application 
of the method of orthogonal polynomials to a problem has been 
given. Thus, from a work sheet providing the series of sums 
Si, S2, Ss, . . . , Eqs. (8) could be used to find the series o, 8, 
y, 6, . . . ; from these, Eqs. (9) could be used to find the series 
a’, B', al, ò, . . . ; from these, Eqs. (10) could be used to find 
the series A, B, C,.... The set of orthogonal polynomials 
fitting the data according to the least-squares criterion could 
then be written y’ = A + Bpı + Bp: + Cpa + .... From 
Eqs. (5), values of pi, ps, ps in terms of ¢ could then be sub- 
stituted, and the final equation of trend in terms of ¢ would be 
found. But it is desirable to effect another economy, by use of 
three tables of values that are the same for all problems having 
the same number of years of data. 

Thus, the use of Eqs. (8) will be greatly facilitated by the use 


of Table 93, a set of constants, awe Ok E gi SS zm 


etc., worked out for various odd values of N, that is to say, for 
various numbers of years, from 11 to 41. The use of Eqs. (10) 
will be greatly facilitated by referring to, Table 94 for the various 
3 ND ONE 2) 
values of the series of constants ———, — — — 


6 30 ` 
N —1)(N — 2 —38 
! X 140 (N ), ete. And the use of Eqs. (5) will 


-— 
12 


be made easier by referring to Table 95 for the values of Ge 
3N? — 7 3N* — 13 
OD 3 1A K^ > @ 


J 


te. 


613 


618'99£'6 622° Tec t ISL‘ egr TF8 ZI 198 Im 

&C0'6€0' 4 86€ 'c96 026 TII 099*0T 08, | 68 
E 981'eyc'c 868° 6PL 062° 16 6E1'6 $02 136 
s WEN 191: SLS DICH 0241 029 |se 
S I89 092 268‘ CEr $06 se | eyo'o | 199 Les 
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Advantages of Method of Orthogonal Polynomials. The method 
of orthogonal polynomials is a great timesaver whenever a trend 
of higher order than a second-degree polynomial is fitted. While 
it has required several pages to describe the method, it will be 
noted that the actual solution of a problem requires little more 
than a page of figures besides the work sheet. This is illus- 
trated in Chap. XXIV. 

But the saving of time is not the sole advantage of the method 
of orthogonal polynomials. In addition, the set of orthogonal 
polynomials that is obtained when values for A, B, C, D... ., 
are obtained, that is to say, 


y = À + Bp: + Cp + Dp + `: 


constitutes the solution for any one of several trend lines. 
Thus y^ = A + Bp; is the straight-line trend; the addition of 
Cp: gives the second-degree polynomial trend; the addition of 
Dps gives the third-degree polynomial trend, ete. It is not 
necessary to recalculate values for A, B, C, . . . , for the various 
trends required. If a problem has been worked out to include 
solutions for A, B, C, and D and subsequently it is decided that 
< E is required, it ean be found by adding one more column to 
the work sheet and finding the value of E without recalculating 
the values of 4, B, C, and D. 

This convenience of obtaining several types of trends from 
one orthogonal set comes from the fact that the terms of the 
orthogonal equation are linearly uncorrelated with each other.! 


! Bee p. 600. 


CHAPTER XXIII 
TIME-SERIES ANALYSIS—SEASONAL VARIATION 


Historical Background. The second major stimulus to the 
development of methods for analyzing time series, listed at the 
beginning of Chap. XX, was the troublesome effects of seasonal 
variations in economic activity. Writers on labor problems 
stress the evil effeets for labor of wide seasonal fluctuations in 
some employments. The effects of seasonal variations upon the 
banking and eredit system were emphasized during the nineteenth 
century and the early part of the twentieth century. Even as 
early as 1793, Alexander Hamilton advised that redemption of - 
the publie debt be carried on during the winter, for, said he, 
‘it is a familiar fact that during the winter in this country, there 
is always a scarcity of money in the towns—a circumstance cal- 
culated to damp the price of stock."* 

Jevons made an analysis of the effects of the “autumnal pres- 
sure” on the London money market and calculated the average 
monthly fluctuations in currency movement between the Bank 
of England and its branches (1855-1862) and the average 
monthly excess of payments or receipts of British coin at the 
Bank of England for the same period.? In 1890, George Clare 
analyzed the seasonal variations for the period from 1881 to 
1890 in the circulation of the Bank of England, in publie deposits, 
in “other deposits,” in “other securities,” in the “reserve,” and 
in the “internal gold móvements."? In 1902, J. P. Norton pub- 
lished a study of the New York money market in which he com- 

128th Congress, Ist Session, Executive Document, 15, p. 199. Cf. Myers, 
Marcarwr G., The New York Money Market, Vol. 1, Origins and Develop- 
ment, p. 208, Other early references to seasonal fluctuations are Hunt's 
Merchants’ Magazine, Vol. 20, p. 302, Vol. 39, p. 582; Journal of Commerce, 
Aug. 3, 1846. 

2 Investigations in Currency and Finance (Foxwell ed., 1909), pp. 158-159. 
Cf. Mrronpa, W. C, Business Cycles—The Problem Stated and Its Setting, 
(1928), pp. 199, 236. 

3 A Money-Market Primer (2d ed.), pp. 19, 24, 31, 42, 53, 55. Cf. Smith, 
James G., BENzAMIN H. Beckganr, and WiuLiaM A. Brown The New York 
Money Market, Vol. 4, External and Internal Relations, p. 424, 
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puted the seasonal variation in loans, the peaks occurring on 
Mar. 4 and in July and December and the low points oceurring 
at the beginning of the year, in May, and at the end of Novem- 
ber.) The outstanding statistical analysis of seasonal variations 
in the New York money market before the First World War is 
that prepared for the National Monetary Commission in 1910 
by Prof. E. W. Kemmerer.? In this study he analyzed seasonal 
variations in money rates, exchange rates, bond yields, currency 
movements, and deposits. His analysis brought out the sea- 
sonal relationships in a striking manner, in spite of very strict 
limitations in available data at the time. Much of his work is 
based upon data gathered by the questionnaire method. 

Causes of Seasonal Variation. Two types of underlying forces 
cause seasonal variations in economic activity: (1) climatie con- 
ditions giving rise to seasons in agrieultural production, in out- 
of-door construction work, in the manufacture of clothing, in 
the use of fuel, and in traveling, ete., and (2) forces arising from 
convention, such as the Christmas and Easter trade and sea- 
sonal style convention.* The effects of these various basic 
seasonal influences upon the New York money market and upon 
the banking and eredit structure of the United States have 
recently been exhaustively studied and published in Vol. 4 of the 
previously mentioned studies of The New York Money Market, 
edited by Prof. Benjamin H. Beckhart of Columbia University.‘ 

Inlarge part the movement for banking reform in this country, 
which culminated in the studies of the National Monetary Com- 
mission and the Federal Reserve Act of 1913, was the result of 
the evil effects of seasonal fluctuations in the demands of trade 
giving rise to periodieal stringencies in the money market and fre- 
quently initiating monetary panies. Consequently, it was one of 
the most important aims of the Federal Reserve System to devise 
an elastic currency and credit system that would accommodate 
these seasonal demands.’ Thus bankihg reform in the United 

1 Statistical Studies in the New York Money Market, pp. 62-64. 

* Seasonal Variations in the Relative Demand for Money and Capital in 
the United States (National Monetary Commission Publications), Vol. 22. 

3 Cf. MITCHELL, op. cit., pp. 236-240. 

* The New York Money Market, Vol. 4, External and Internal Relations, 
pp. 417-542. 


5 The New York Money Market, Vol. 2, Sources and Movements of Funds, 
pp. 155-374. 
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States is a case in which a long-recognized evil was finally statisti- 
cally measured and evaluated anda reform in the system definitely 
resulted in improvement. 

Not only in the field of banking has the study of seasonal 
variation by statistical methods been stimulated. In addition, 
unemployment with all its economic, social, and psychological 
implications has aroused great concern about the measurement of 
such variation. Extended reference to the problem of seasonal 
unemployment was made at former President Hoover’s Con- 
ference on Unemployment, in the Report and Recommenda- 
tions of the Committee to Investigate Business Cycles and 
Unemployment.! 

In the hearings before the Committee on Education and 
Labor, of the United States Senate, in 1928-1929, much material 
and discussion are devoted to the subject of the seasonal varia- 
tions in employment in industries and trade.? Franklin D. 
Roosevelt, when governor of New York State, appointed a Com- 
mittee on the Stabilization of Industry for the Prevention of 
Unemployment, which made its report to him in November, 
1930, entitled Less Unemployment through Stabilization of 
Operations, in which the subject of seasonal variations in 
employment constituted an important part. 

During the years leading up to the depression of the 1930's, 
much was written on seasonal variation in employment and its 
contemplated stabilization. Thereafter, the problem of cyclical 
unemployment and its solution by means of unemployment 
insurance and the entire social security program dominated the 
scene.? 

1 New York, 1923, pp. 6, 116-120, 161, 215. 

270th Congress, 2d Session, "Unemployment in the United States," 
S.R.219. 

3 Surrg, Epwin S., Reducing Seasonal Unemployment, The Experience of 
American Manufacturing Concerns (1931). Doveras, Paur H. and Aaron, 
Direcror, The Problem of Unemployment. This book devotes pp. 73-118 
to the subject of seasonal variations and regularization of industry to 
stabilize such fluctuations. Hansen, Arvin H., and Tiruman M. Soaen, 
Seasonal Irregularity of Employment in Minneapolis, St, Paul and Duluth 
(Employment Stabilization Research Institute, November, 1931). BER- 
riper, W. A., “Employment and Income of Labor in the United States,” 
in International Unemployment (a study of fluctuations in employment and 
unemployment in several countries, 1910-1930, Industrial Relations 
Institute, The Hague, Netherlands, 1932). 
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A familiar example of seasonal activity in the economie sphere 
is construction activity, which gives not more than two-thirds 
as much employment in the winter months, on the average, as 
in the summer. Some important manufacturing industries, too, 
such as the automobile, agricultural implements, and ready-made 
clothing industries, show a considerable seasonal fluctuation. 
To be sure, the busy season in some industries comes in the dull 
season for others, a fact that tends to level out the differences 
between the number employed in industry in its entirety in 
one month as compared with another. But this does not mean 
that the workers released by one industry are absorbed by 
another to a sufficient degree or with sufficient promptitude to 
obliterate the variations from month to month in the amount 
of their employment. Barriers of specialized skill, geography, 
and attachment to particular occupations and localities prevent 
anything like the dovetailing suggested by the figures of the 
total number employed.' Consequently, the statistics of total 
employment may show little seasonal variation, while at the 
same time large degrees of seasonal unemployment exist in many 
parts of the total. The fact that there is no seasonal variation 
or little seasonal variation in total employment does not solve the 
unemployment problem for the seasonally unemployed worker. 

One reason why concern, statistically speaking, about the 
subject of seasonal variations in employment has been stimulated 
is because the opinion prevails that this particular type of 
unemployment is in large part avoidable. The movement to 
inaugurate unemployment insurance in the United States was 
partly based upon the belief that such a measure for the relief 
of unemployment would tend to regularize industries affected 
by seasonal unemployment. It is recognized that the greater 
problem of cyclical unemployment is less easily solved. ‘The 
literature on the subject of unemployment insurance in the 
United States makes it clear that the movement is directed par- 
ticularly toward the regularization of industry to eliminate as 
much as possible of the seasonal fluctuation in employment.” 

With these problems in mind, students of the labor problem 
asked: What types of business are responsible. for the largest 

*McCaze, Davin A., chapter on Unemployment, in Facing the Facts (a 


symposium, 1932), pp. 324-325, 338-351. 
2 Cf. McOnBE, op. cit., pp. 344-346, 350. 
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part of this seasonal irregularity in employment? What are the 
peak and slack seasons of employment in different businesses, and 
what are the amplitudes of fluctuations? What can be done to 
make business less seasonal and less irregular? What is the cost 
of regularization plans in an industry, and how do such costs com- 
pare with the savings resulting from more regular use of capital 
investment? These are the types of exceedingly practical prob- 
lems presenting themselves in this field of economics, and they 
have stimulated statistical research to take measurements of 
seasonal variations. They are of practical significance to 
employers and to investors and to workers. They are of great 
social and psychological significance to the social scientist, the 
economist, and the political theorist. 


METHODS or MEASURING SEASONAL VARIATIONS 


It has been seen that the method of discovering trends either 
for their own sake (rational trends) or in order to remove them 
from the data, t.e., to get rid of them (empirical trends), has been 
based upon curve-fitting technique. The technical problem 
involved is a simple one even though the mathematies may be 
complex in some cases. The simplicity of the idea is somewhat 
offset, however, by the irrational character of the procedure. 
This is a troublesome factor because it is the function of the 
statistician not only to apply mathematical analysis to statistics 
but also to explain what he does and why he does it. Enough 
has been ineluded in Chaps. XX and XXI, to indicate the 
general character of this problem. 

In the ease of seasonal variation, the difficulties of the statis- 
tician are just the reverse; for while it has been possible to build 
up a perfectly rational procedure, upon the basis of the theory of 
averages, the technical problem involved has been found to be 
acomplex one. The rational concept underlying the procedure of 
measuring seasonal variation is that, where a time series has a 
characteristic seasonal variation occurring year after year, it 
should be quite reasonable to depict a “typical,” or average, 
seasonal variation for that time series. . 

In its abstract aspect, therefore, the concept is perfectly 
rational. Homogeneous variates are to be averaged to obtain 
a type. For example, it is proposed to average the amount 
by which January data are higher or lower than those of other 
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months of the year, to average the amount by which February 
data are higher or lower than those of other months of the 
year, and so on, until a picture of the “type” of periodical move- 
ment that occurs every year is obtained, although each year may 
be slightly different from the type. Moderate variations from 
the type are quite consonant with the theory of averages and 
their application to the problem of measuring seasonal variation.' 

When the rational procedure is to be put into effect, however, 
difficulties of a technieal character arise. A time series of raw 
data that by a priori knowledge should have a distinctively 
regular seasonal variation may be selected. A graph of the time 
series is made, and a seasonal variation occurring every year is 
revealed, but the seasonal periodicity in the raw data is distorted 
by other movements, namely, trend and cycle. This was noted 
at the beginning of Chap. XX where a hypothetical time series 
was constructed. It is clear that the data in their raw state 
eannot be averaged to find the typical seasonal periodicity. 
That is to say, January, 1937, is not homogeneous with respect 
to seasonal variation with January, 1940, because the relative 
position of the respective Januaries (1) as to trend and (2) as to 
cycles is not comparable. In other words, averaging the raw 
data of all the Januaries in a series of data, all the Februaries, 
ete., for the 12 months of the year would be an irrational pro- 
cedure. This would not accord with the rational idea of seasonal 
variation outlined above because the averages of raw data would 
include averages of something in addition to seasonal variation. 

Problem of Isolating Seasonal Variation. To average the 
actual seasonal variation, it must be isolated from the other 
types of variation in the raw data, The technical problem 
involved in the measurement of seasonal variation is thus how to 
isolate from the raw data that part of its fluctuation that is 
essentially seasonal in character. When these other types of 

1 If the seasonal variation were measured weekly, rather than monthly, 
the principle would be the same. The same principle may be used to 
measure periodicity by days within the month or within the week; and it 
may likewise be used to measure periodicity by hours within the day. 
Thus periodicity by days within the month of wage payments might have 
great economic value for some problems; and periodicity by hours within 
the day of consumption of electrical power might have significance in con- 


nection with some problems. Seasonal variation is only one type of 
periodicity that can be measured by this method. 
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fluctuation have been removed from the raw data, by subtraction 
or division, all the residual Januaries can be rationally averaged, 
all the residual Februaries can be averaged, etc., in order to 
obtain a picture of the average position, respectively, of each 
month, 

How can this be done? There are several answers to this 
question, and there is controversy as to just what is the best 
technical procedure. In his notable studies of seasonal variation 
made about 1910, Prof. Kemmerer devised a method for measur- 
ing seasonal variation separate from other fluctuations, At the 
time, it was the best that had been suggested." 

Another famous suggestion as to a method of isolating the 
seasonal periodicities from other types of fluctuations was made 
by W. M. Persons, when from 1915 to 1919 he developed his 
approach to the problems of time-series analysis, culminating 
in the establishment of the Harvard Economic Society's business 
barometer and the’ Review of Economic Statistics, Persons’ 
method, called the “link relative method,” expresses each 
monthly figure as a relative of the immediately preceding month; 
the seasonal pattern is found by averaging all the link relatives 
for the same month and taking any residual trend out of the chain 
relatives computed from these average link relatives.? 

A third method of isolating the seasonal fluctuations and 
measuring them by an index of seasonal variation is that advocat- 
ing simply the removal of trend from the data and then the 
averaging of the monthly ratio differences from the trend.* 
While this method removes the nonhomogeneous effects of trend, 
it does not remove those due to cyclical fluctuations, If taken 
over a sufficient period of time, the bias of the cyclical fluctuations 
will cancel so that a true index of seasonal variations would be 


10p. cit. Cf. criticism of Kemmerer's method by W. L. Hart, Journal of 
the American Statistical Association, Vol. 17 (1922), Kemmerer's work 
constitutes an important pioheer effort to solve the technical difficulties 
involved and helped direct attention to better solutions. 

2 Review of Economic Statistics, January, 1919, pp. 18-31; Indices of 
Business Conditions (1919). Cf. Buerg, H. L., Handbook of Mathematical 
Statistics, pp. 151-156. 

?FankwER, Heres D., “The Measurement of Seasonal Variation,” 
Journal of the American Statistical Association, Vol. 19 (1924), pp. 167-179; 
Ross, Ricnarp A., "Variate Difference Method of Seasonal Variation,” 


ibid., Vol. 24 (1929), pp. 250-257. 
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obtained by averaging. One of the discoveries of recent years, 
however, is that seasonal variations change in the same time 
series from one era to another owing to new conditions; for such 
time series an index of seasonal variation based on a period o! 
time covering two or more eras would be comparatively useless. 
Thus the criticism of the ratio-difference-from-the-trend method 
is that, if taken over a sufficient period of time to make it a 
valid measurement of seasonal variation, it would be taken for 
too long a time, i.e., that two or more eras of typical seasonal 
fluctuation might be confused. 

A number of other methods have been suggested, based 
upon the principles that have been outlined.' The most widely 
used and probably the best method is the 12 months’ moving 
average method, of which a number of refinements have been 
suggested. Since this method is the one most extensively used 
it is now described in detail and an illustration will be given. 

Twelve Months’ Moving Average Method. This method 
consists of the following steps: 

1. Caleulate a 12 months’ moving average of the raw data, 
centering the moving average at the seventh month; thus, 
opposite July of the first year would be the average of the 
12 months of that year; opposite August would be the average 
of the last 11 months of that year and the first month of the next 
year; and so on. 

2. Divide the raw data serially by the 12 months’ moving 
average. Inasmuch as the moving average would contain 
in it the elements both of trend and of major and minor cycles, 
the residuals of the raw data from the moving average (either 
by subtraction or division) would contain purely seasonal 
fluctuations. 


1 Kina, W. I., “An Improved Method for Measuring the Seasonal Fac- 
tor,” Journal of the American Statistical Association, Vol. 19 (1924), pp. 
301-313; CARMICHAEL, F. L., “Methods of Computing Seasonal Indexes: 
Constant and Progressive,” ibid., Vol. 22 (1927), pp. 339-354: Jay, ARYNESs, 
and Tuomas WooprrEr, “Use of Moving Averages in the Measurement of 
Seasonal Variation, ’ ibid., Vol. 23 (1928), pp. 241-252; Baumann, A. O., 
“Thirteen Months-Ratio-First Difference Method of Measuring Seasonal 
Variation,” ibid., Vol. 23 (1928), pp. 282-290; Kuznurs, Simon, ‘Seasonal 
Patterns and Seasonal Amplitudes: Measurement of Their Short-time 
Variations,” ibid., Vol. 27 (1932), pp. 9-20; RiGGLEMAN, Jonn R., and IRA 
N. FRISBEE, Business Statistics (1932), pp. 226-242. 
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3. Make frequency distributions of the several months (see 
the example at the end of the chapter). 

4. Find the median relatives for each month, using the median 
to avoid the influence of extreme fluctuations. 

5. Express these median relatives as a percentage of their own 
average, thus giving an ¿ndez of seasonal variation. 

As a short cut, inasmuch as the result will be precisely the 
same, the 12 months’ moving total may be used instead of the 
12 months’ moving average, thus saving the division throughout 
by 12. 

Problem Illustrating Measurement of Seasonal Variation. 
Calculating the Index of Seasonal Variation by the 12 Months’ 
Moving Average Method. The time series of monthly data on 
consumer installment-sale debt for household appliances in the 
United States has been selected to illustrate the calculation of an 
index of seasonal variation by the 12 months’ moving average 
method. Table 96 is a work sheet for the calculations necessary 
to the problem. ‘The data were recorded on this work sheet for 
the years 1929-1942 by months, the raw data appearing in 
column (1). Next, a 12 months’ moving total was calculated; 
this appears in column (2), the moving total being “centered at 
the seventh month.” For example, the figure 2,930 after July, 
1929, in column (2) of the work sheet is the total of the 12 
monthly figures for 1929; the figure 2,972 (opposite August, 
1929) is the total of the next 12 monthly figures, beginning with 
February, 1929, and ending with January, 1930. Opposite each 
July is the total for that year; this constitutes a good cross 
check in the construction of the moving total. 

To calculate the moving total, first put the 12 monthly figures 
for 1929 in the adding machine, and take a subtotal; then sub- 
tract the datum for January, 1929, and add the datum for 
January, 1930, and take a subtotal; then subtract the datum for 
February, 1929, and add the datum for February, 1930, and 
take a subtotal; and so on, until the end of the time series. 
Clear the machine, and then add independently the last 12 
months of the time series; this should check with your last 
subtotal. If it does not check, a mistake has been made, which 
can be most readily found by checking up on the July subtotals 
for each year, beginning with the last one and going back until 
you find the mistake. These subtotals are the 12 months’ 
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TABLE 96.— WORK SHEET FOR CALCULATING INDEX op SEASONAL VARIATION 

Dara: Consumer installment-sale debt, monthly, for household appliances, 
end of month 

(In millions of dollars) 


Year and month 


a) 


(2) 


(3) 


12 months’ 


Raw data divided 
by 12 months’ 


Monthly moving total 
raw data centered at moving total, 
7th month per cent 
1929: 
JanuarY- iie er EE ES 207 
eu te t 199 
March. . 199 
April 217 
May 237 
June 260 
dE 273 2,930 9.32 
irat ee | 274 2,972 9.22 
ll «cero euo Pes ets 272 3,006 9.05 
October... 206 3,031 8.78 
November. En 261 3,043 8.58 
"Dederober.z c scere gene 265 3,043 8.71 
1930: 
UE aA EE SUD ODE URS 249 3,031 8.22 
TODTURYYI UT vue x SURE ose 233 3,010 7.74 
March.. 224 2,984 7.51 
April.. 229 2,953 7.75 
May.. 237 2,919 8.12 
RUNG RTT ERN VOTE 248 2,881 8.01 
BD TENEAT EST. 252 2,838 8.88 
August 248 2,800 8.86 
September 241 2,766 8.71 
October... 232 2,734 8.48 
November zu 223 2,701 8.26 
qDecember. E 222 2,666 8.33 
1931: 
CERNE Tagen Ee E 211 2,628 8.03 
February 199 2,588 7.69 
March.. 192 2,548 7.54 
April. 196 2,509 7.81 
May.. 202 2,471 8.17 
June.. 210 2,434 8.63 
dE EA 212 2,397 8.84 
ANEUS fatten S En ID 208 2,357 8.82 
September 202 2,318 8.71 
October... 194 2,275 8.53 
November 186 2,225 8.36 
December... 185 2,167 8.54 
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TABLE 96.—WorK SHEET FoR CALCULATING INDEX OF SEASONAL VARIA- 
TION.—(Continued) 


D) 


(2) 


(3) 


Year and month 12 months’ | Raw data divided 
Monthly moving total by 12 months’ 
raw data centered at moving total, 
7th month per cent 
1932: 
January.... 171 2,101 8.14 
February. SH GE 160 2,028 7.89 
Er, TENRA AN 149 1,954 7.63 
April. 146 1,881 7.76 
May. 144 1,813 7.94 
Jor aee sania a ee 144 1,749 8,28 
July.. 139 1,685 8.25 
August... 134 1,628 8.23 
September. 129 1,575 8.19 
October..... 126 1,528 8.25 
Age EE d 122 1,484 8.22 
December ri ieurs hnn | 121 1,447 8.36 
1933: 
JANUALY ee ep TREE 114 1,418 8.04 
February. 107 1,398 7.65 
March. 102 1,386 7.36 
April. 102 1,378 7.40 
May. 107 1,372 7.80 
June. 115 1,367 8.41 
July.. 119 1,365 8.72 
122 1,364 8.94 
121 1,365 8.86 
teg S cron ee ue ie 120 1,370 8.76 
November... pe ables 1,384 8.45 
December, 59v Sy tees te 119 1,403 8.48 
1934: 
TELE ek LTEM 113 1,422 7.95 
February. 108 1,441 7.49 
March... 107 1,456 7.35 
April. 116 1,468 7.90 
May. 126 1,479 8.52 . 
June.. 134 1,490 8.99 
July... 138 1,502 9.19 
August... 137 1,515 9.04 
September 133 1,528 8.70 
October... . 131 1,544 8.48 
November 128 1,561 8.20 
December 131 1,578 8.30 
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TABLE 96.—WorK-SHEET FOR CALCULATING INDEX OF SEASONAL VARIA- 
TION.—(Continued) 


a) (2) (3) 
Year and month 12 months" Raw data divided 
Monthly moving total by 12 months’ 
raw data centered at moving total, 
7th month per cent 
1935: 
DEN OL Vora. (Aint ea Ren vhs 2e 126 1,600 7.88 
February.. Za 121 1,626 7.44 
aC a RAE EE e 123 1,657 7.42 
ADEN E Ee 133 1,692 7.86 
May Be 143 1,728 8.28 
Oe erus Tells «d 156 1,768 8.82 
July 164 1,808 9.07 
August.... 168 1,845 9.10 
September. Sek * 168 1,882 8.93 
OCUGHER cc = crop E 167 1,921 8.69 
November... at Ee 168 1,964 8.55 
Dederoberc- Eege vg 171 2,017 8.48 
1936: 
163 2,075 7.86 
158 2,143 7.37 
162 2,211 7.33 
176 2,281 7.72 
196 2,353 8.33 
214 2,428 8.81 
232 2,512 9.24 
236 2,596 9.09 
September. 238 2,683 8.87 
October... 239 2,772 8.62 
November. GR ENTER, 243 2,863 8.49 
Desember... n U 1. 255 2,952 8.64 
1937: 
AAEIATY: c Use e Sede da 247 3,045 8.11 
February... 2. o vk. 4 245 3,130 7.83 
ternet gel 3,217 7.80 
April Tel xv 207: "3,302 8.09 
May E EE 285 3,382 8.43 
SE EE 307 3,451 8.90 
July & | 817. 3,503 9.05 
August. ... .| 328 3,552 9.09 
September... 2... ai 4|. 328 3,594 8.99 
October. con cC e| 319 3,024 8.80 
November. F 312 3,639 8.57 
December nut x 307 3,636 8.44 
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TABLE 96.—Wonx SHEET ror CALCULATING INDEX or SEASONAL VARIA- 
TION.—(Continued) 


WI (2) (3) 
Year and month 12 months’ | Raw data divided 
Monthly moving total by 12 months’ 
raw data centered at moving total, 
Tth month per cent 
1938: 
Jann ae Ee 296 3,610 8.20 
February 287 3,571 8.04 
March... 281 3,525 7.97 
April.. 282 3,475 8.12 
May. 282 3,420 8.24 
June... 281 3,371 8.34 
Faly cakes 278 3,330 8.35 
August. . 277 3,290 8.42 
September. 273 3,253 8.39 
October. 264 3,218 8.20 
November. E01. 1208 3,183 8.26 
December... . 6-5. . cee ese 266 3,154 8.43 
1939: 
January... bee eee nne 256 3,153 8.12 
February 250 3,120 8.01 
March. 246 3,110 7.91 
April 247 3,103 7.96 
May. 253 3,104 8.15 
June. SE cee ey 2060. 3,106 8.37 
Ok EE 265 3,113 8.51 
August. . 267 3,119 8.56 
September. 266 3,124 8.51 
October... 265 3,131 8.46 
November. 265 3,143 8.43 
December. . . 273 3,161 8.64 
1940: 
EEN EHE elut 262 3,182 8.23 
February 255 3,205 7.96 
March. 253 3,232 T. 
April. ... 259 3,259 a: 
May.. 271 3,284 8. 
June. 281 3,309 8. 
July... . 288 3,338 8. 
ET EE EE 294 3,366 8. 
September... 293 3,397 8. 
October... 290 3,430 8. 
November 290 3,474 8. 
December... oen] ee 3,523 8. 
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TABLE 96.—Wonk SHEET ror CALCULATING INDEX or SEASONAL VARIA- 
TION.— (Continued) 


a) (2) (3) 
Year and month 12 months’ Raw data divided 
Monthly moving total by 12 months’ 
raw data centered at moving total, 
7th month per cent 
1941: 
290 3,572 8.12 
286 3,620 7.90 
286 3,672 7.79 
303 3,721 8.14 
320 3,764 8.50 
330 3,794 8.70 
336 3,805 8.83 
346 3,809 9.08 
September, ul. UP a 342 3,808 8.98 
October.. ` .| 333 3,794 8.78 
November. ceca ee ecce 320 3,749 8.54 
[Decembre sione zone exi 313 3,670 8.53 
1942: 
JADUAESO, E RENS ES 294 3,559 8.26 
February 285 3,425 8.32 
March.. 272 3,262 8.34 
ABG TL 258 3,089 8.35 
May.. 241 
June 219 
July ^ 202 
August..... 183 
September 169 
October.... 
November... . 
December | 


Source: Hournavusen, Duncan McG., "Monthly Estimates of Short-term Consumer 
Debt, 1929-1942," Survey of Current Business, Vol. 22 (November, 1942), pp. 9-25, 


moving total and can be tabulated in column (2) of the work 
sheet, as in Table 96. The next step is to divide each monthly 
raw datum by the corresponding moving total figure, expressing 
the answer as a percentage figure in column (3). The figures in 
column (3) are then tabulated in a system of frequency arrays 
as in Fig. 147. 

From Fig. 147 the median monthly relatives are read and 
arranged as in Table 97. 
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Fra. 147.—Frequency arrays, one for each month, of distributions of monthly 


ratios of raw data to 12 months’ moving total. 


Column (3) of Table 96. 


Con- 


sumer installment-sale debt for household appliances in the United States, 


1935-1942, 


Column (1) of Table 97 consists of the median relatives read 
from Fig. 147; and these median relatives have only to be 


TABLE 97.—INDEX op SEASONAL VARIATION IN CONSUMER INSTALLMENT- 
SALE DEBT ror HOUSEHOLD APPLIANCES IN THE UNITED STATES 


Month 


Medians 


Index of sensonal 


variation! 


AE 
February 


September. 
October. . 
November. 
December........... 


KEE 


.13 


8 96.2 
7.95 94.0 
7.82 92.5 
8.05 95.2 
8.25 97.6 
8.72 103.2 
8.85 104.7 
9.10 107.6 
8.88 105.0 
8.64 102.2 
8.50 100.6 
8.55 101.1 

101.44 1,200.0 
8.4533+ 


1 This column consists of the medians expressed as percentages of their average. 


8.13 is 96.2 per cent of 8.4533+. 


Thus 


expressed as percentages of their own average to give the index 
of seasonal variation. This is done, giving the figures in column 
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(2) of Table 96. No adjustment is necessary because trend and 
cyclical influences were removed by the use of the 12 months’ 
moving total as a divisor into the raw data, thereby isolating 
residuals each month that presumably contained seasonal varia- 


suni | 


x 
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Fic. 148.—Studies in seasonal variations in commercial paper rates, before and 
since the establishment of the Federal Reserve System. 


tions and chance fluctuations. By averaging, the chance 
fluctuations are canceled out, leaving in the index a description 
of relative seasonal movement. The theory of this method is 
based, of course, upon the use of a 12 months’ moving average; 
but precisely the same arithmetical results are obtained by using 
the moving total instead, and itis a saving of a considerable num- 
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ber of division processes. In dividing by the 12 months’ moving 
total instead of by the 12 months’ moving average, the average 
residual percentage is 8.333+, whereas, in dividing by the 
12 months’ moving average, the average residual percentage 
would be 12 times 8.333-++, or 100.00. 

From the multiple frequency array, as in Figs. 147 and 148, it 
can be determined whether or not the seasonal variation is well 
defined. If the course of all the recorded ratios of raw data to 
the 12 months’ moving total by months tends to run close to the 
course of the medians, then the seasonal variation is a well- 
defined one. If, however, the points are scattered in a wide 
range from the medians and the general swing of the data does 
not correspond to the movements of the median line, then the 
seasonal variation is not well defined. Such a result might be 
obtained if the type of the seasonal variation were changing, and 
in that case the data may be studied in groups of a smaller num- 
ber of years. Figure 148 is included to present examples of 
poorly defined seasonal variation, as compared with well-defined 
cases of seasonal variation. The data studied are commercial 
paper rates in the New York money market before and after 
the inception of the Federal Reserve System. From the figure 
it is seen how well defined the seasonal variation in commercial 
paper rates was before the beginning of the Federal Reserve 
System—namely, for the periods 1904-1909 and 1909-1914. 
Also, it is seen how poorly defined is the seasonal variation for 
the periods 1920-1925 and 1925-1930—so poorly that there 
could hardly be said to have been any consistent seasonal 
periodicity whatever.! 


METHOD OF DETECTING CHANGING SEASONAL VARIATION 


Figure 149 is drawn to discover whether or not, during the 
years from 1929 to 1941, the seasonal variation in consumer 
installment debt for household appliances has changed.* The 


1 For a more complete discussion see The New York Money Market, Vol. 
4, pp. 510-530. 

2 For other suggested methods of measuring changing seasonal variation 
see Julius Shiskin, “A New Multiplicative Seasonal Index," Journal of the 
American Statistical Association, Vol. 37 (1942), pp. 507-516; Henry A. 
Latané, “Seasonal Factors Determined by Difference from Average of 
Adjacent Months,” Journal of the American Statistical Association, Vol. 
37 (1942), pp. 517-522; Dudley J. Cowden, “ Moving Seasonal Indexes," 
Journal of the American Statistical Association, Vol. 37 (1942), pp. 523-524. 
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Fro, 149.—Trends in seasonal variation in consumer instalment- 
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figure is a plot of the ratios to 12 months’ moving total shown 
in the last column of Table 96; a separate graph for each month 
has been drawn in Fig. 149. 

The straight-line trends for each month were drawn in “at 
sight"; they were not fitted by a mathematical method. Figure 
149 shows that the relative seasonal position of January, June, 
October, November, and December remained about the same 
during this period of years. But the relative seasonal amount of 
consumer installment debt was rising in the months of February, 
March, April, and May, while the relative seasonal quantity of 
consumer installment debt was declining in the months of 
July, August, and September. 

Consequently, a more refined index of seasonal variation 
than the average of a period of years such as that shown in 
Table 97 and Fig. 148 can be obtained from Fig. 149. In fact, 
since these trends-exist, a different index of seasonal variation 
for each year is required. For 1942 this index of seasonal varia- 
tion can be obtained as indicated in Table 98. 


TABLE 98.—COMPUTATION op INDEX OF SEASONAL VARIATION IN CONSUMER 
INSTALLMENT-SALE DEBT FOR HOUSEHOLD APPLIANCES, 1942 


Ratios, read Index of 
Month ` | from trend lines seasonal 
in Fig. 149 variationt 
BT Go I sca Me E 8.10 96.4 
EEN 8.09 96.2 
March.. 8.08 96.1 
April.. Kë 8.15 97.0 
d bt 8.35 99.3 
dune Moses E 8.60 102.3 
July 8.65 102.9 
August. . 8.75 104.1 
September. Al 8.65 102.9 
October... 8.54 101.6 
November. 8:40 99.9 
December: vu. sce. q a au | 8.50 101.1 
CL Gti Saavedra ey ee ee 100.86 1,209.0 
PAVERABO A Ku d 8.405 


1 Obtained by expressing ratios in the first column as percentages of their own average. 


CHAPTER XXIV 
DETERMINATION OF CYCLE 


Usually it is desirable to have current figures on a monthly 
basis, and to know how actual experience compares with what 
should be expected for the season and with normal growth. 
Can we estimate our position in the business cycle from month 
to month? Annual data adjusted for trend and a picture of 
undistorted seasonal variation, illustrated in the preceding 
chapters, do not go quite far enough. It is often necessary 
to remove trend and seasonal variation from monthly data in 
order to determine position in the cycle. 

Cycle Determined by Adjusting Monthly Data. When monthly, 
instead of annual, data are analyzed, the empirical trend may be 
found by setting up a work sheet similar to Table 83 or 86 (pages 
582, 588), depending upon the type of trend selected. The 
trend is then fitted by the method of least squares in a manner 
precisely similar to that demonstrated for annual data. Of 
course, if quite a number of years of monthly data are thus 
treated, the calculations become very extended, but the principle 
remains the same.! It is possible, however, to derive an approxi- 
mation of the monthly trend equation from an annual trend 
equation. This is explained in the present chapter and may 
serve as an economizer of time in the analysis of monthly data. 

Determination of Cycle in Annual Data. While the purpose 
of this chapter is to present a method for measuring the cycle 
in monthly data, it may be noted at the start that even if the 
object of analysis is to.determine cycle in monthly data it is 
desirable first to study the annual data. Not only is this true 
because the monthly trend may be easily estimated from the 
annual trend, but it is also desirable because the general character 


1 Where trends are calculated for monthly data, involving long series, 
convenient short-cut methods of calculation have been devised. Cf. Ross, 
F. A., Formulae for Facilitating Computations in Time Series Analysis,” 
Journal of the American Statistical Association, Vol. 20 (1925), pp. 75-79; cf. 
also timesaving devices discussed in Chap. XXII. 
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of the results on a monthly basis can be visualized from the 
analysis of the annual figures and the annual data can be analyzed 
with a much smaller amount of computation. The analysis 
on an annual basis will help judge the kind of analysis required 
for the monthly data, whether to use a straight-line trend ora 
second- or third-degree polynomial trend. In addition, the 
analysis on an annual basis will help to decide the significance 
of the respective trends. This will now be illustrated by making 
use of monthly and annual averages of monthly data on consumer 
installment-sale debt for household appliances in the United 
States, 1929-1942. * 

The work sheet is not reproduced here, but one similar to 
Table 91 (page 594) was constructed and the following set of 
subtotals was obtained: S; = 2,842; S» = 18,159; Ss = 89,614; 
and S, = 362,901 (in millions of dollars). Using the method 
of orthogonal polynomials, which is the quickest method of 
finding at once the first-, second-, and third-degree polynomial 
trend lines by one set of calculations, the following results 
were obtained by the application of Eqs. (5) and (8) to (10), 
Chap. XXII (pages 605-612): 


e d = 218.61538 

pm Ba — 199.54945 
oes a = 196.95384 
pb SCH = 199.39615 


The values of the denominators in the above fractions were 
obtained from Table 98 (page 613). These calculations are 
carried to more places than are significant for the problem because 
they must be combined in multiple proportions to obtain the 
following results by using Eqs. (9) and (10), Chap. XXII 
(pages 611-612): 


*HorrHAusEN, Duncan McG., “Monthly Estimates of Short-term 
Consumer Debt, 1929-1942,” Survey of Current Business, Vol, 22 (Novem- 
ber, 1942), pp. 9-25, 17. 
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a’ = 218.61538 A = 218.61538 
19.06593 B= 9.53296 
* 13.87471 C= 315334 
š = —6.12307 D = —0.64948 


Accordingly, the following set of orthogonal-polynomial 
trends is obtained; the first line is the straight-line trend; the 
first line combined with the second line is the second-degree 
polynomial trend; the first, second, and third lines combined 
give the third-degree polynomial trend: 


y' = 218.61538 + 9.53296p1 


+ 3.15334p; 
— 0.64948p5 


2-0 
M" Wu H 


that is to say, since pí = l, pa = É — 14, and ps = & — 251 
when N = 13 (see pages 605 and 615), 


y’ = 218.61538 + 9.532961 
+ 3.15334(2 — 14) 
— 0.64948(18 — 25) 


The three possible trends are therefore the following (in 
millions of dollars): 
Straight-line trend: 


y! = 218.6 + 9.5334 (origin at 1935) 
Second-degree polynomial trend: 
y = 174.5 + 9.533 + 3.1532 (origin at 1935) 
Third-degree polynomial trend: 
y = 174.5 + 25.77t + 3.1538? — 0.64950? (origin at 1935) 


Table 99 is presented to show the raw annual data and the 
annual values of eache of these three trends, which are also 
presented graphically in Fig. 150. 

An annual increment averaging something less than 10 on a 
base of over 200 is not too great to deter the assumption that the 
straight-line trend roughly depicts rational long-term growth. 
If it is appropriate to suppose that installment-sale debt would 
tend to grow at the rate of population growth, which is pre- 
sumably geometric, the trend line fitted should be a logarithmic 
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TABLE 99.—ANNUAL TREND ANALYSIS OF CONSUMER ÍNSTALLMENT-SALE 
DEBT FOR HOUSEHOLD APPLIANCES IN THE UNITED SrArss, 1929-1942 
(In millions of dollars) 


SC -d Third: 
EA S gree ird-degree 

Ys Raw data | Steightsine | Sieg: 1 polynomin 

: trend trend 
1929 244 161 231 | 274 
1930 236 171 206 206 
1931 200 180 187 163 
1932 140 190 174 143 
1933 114 200 168 141 
1934 125 209 168 152 
1935 151 219 174 174 
1936 209 228 187 203 
1937 292 238 206 233 
1938 277 247 232 263 
1939 259 257 263 286 
1940 278 266 301 301 
1941 317 276 345 302 
1942 | ` | 2861 | 3961 2871 

Ac DUCUNT cer ros mls asa Meri ee EE 
* Data available for only the first 9 months of the year; this year was not used in fitting 
the trends. 


t Extrapolated, 
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Fra. 150.—Annual trend analysis of consumer installment-sale debt for house- 
hold appliances in the United States, 1929-1942. 
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trend; but for a short period of years the straight-line trend is an 
adequate approximation of the more appropriate logarithmic 
trend, The straight-line trend in the data presently studied may 
consequently be used as the base from which the major cycle in 
the data can be measured 

The second-degree polynomial trend shows a sharp rise for the 
later years, and if extrapolated beyond 1942 it would quickly 
approach infinity; it is not, therefore, a reasonable picture of 
rational growth. The straight line comes nearer to what would 
be the result if a logarithmic trend were fitted, if data covering 
a long enough period were available to afford sufficient per- 
spective to obtain a growth curve. 


Tastre 100.—Cycrican MOVEMENTS IN CONSUMER INSTALLMENT-SALE 
DEBT FOR HOUSEHOLD APPLIANCES IN THE UNrrED STATES, 1929-1942 


Lm aed g Cycle mixed 
Raw date, | Straight-line Second ae pe ott with residuals, 

Year millions of | trend, millions | trend, millions P S 

ollars of dollars | "of dollars E D 

D y P (y! + 100) 4s EO) 

1929 244 161 274 170 152 
1930 236 171 206 120 138 
1931 200 180 163 90 111 
1932 140 190 143 75 74 
1933 114 200 141 70 57 
1934 125 209 152 73 60 
1935 151 219 174 79 69 
1936 209 228 203 89 92 
1937 292 238 233 98 123 
1938 277 247 263 106 112 
1939 259 257 286 111 101 
1940 278 266 301 113 104 
1941 317 276 302 109 114 


The third-degree polynomial trend seems adequately to 
represent the rounded contour of a major cycle. Examination 
of Fig. 150, accordingly, leads to the conclusion that the straight- 
line trend can be used to depict growth in the data, and the 
third-degree polynomial trend can be used to measure the major 
cycle. As a consequence, the raw data divided by the straight- 
line trend should give a picture of the major cyclical movement 
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in this period (plus the residual fluctuations) ;! and the third- 
degree polynomial trend divided by the straight-line trend gives 
a measure of the major cycle. Table 100 and Fig. 151 give the 
results of such computations. The column headed Cycle in 
Table 100 consists of the second-degree polynomial empirical 
trend divided by the straight-line empirical trend, giving as a 
result a smoothed measure of the major cycle. This is shown 
by the heavy line in Fig. 151. The column headed Cycle mixed 
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Fra. 151.—Cyclical study of consumer installment-sale debt for household 
appliances in the United States, 1929-1942. 


with residuals in Table 100 consists of the raw data divided by 
the straight-line empirical trend, giving as a result the measure 
of the cycle mixed with residual fluctuations in annual data.” 
Both these columns are expressed as percentages, with the y' 
for each year equal to 100. 

Determination of Cycle in Monthly Data. Cycle Determined 
by Adjusting Monthly Data. Monthly data, when examined to 
discover the cycle, must be adjusted not only for trend but also 
for seasonal variations. Adjusting monthly data for trend 
and seasonal variation in order to measure cyclical movements 


1In addition, there might be minor cyclical movement, a fact that could 
be determined by further analysis of data extending over a longer period of 
time. 

2 See p. 570 for meaning of “residual fluctuations” in time series. In 
this instance, the residuals might include short-cycle fluctuations. See 
also Chap. XXV, pp. 659-661. 
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TanLE 101.—Wonx SHEET ror CALCULATING MONTHLY INDEX or Con 
Description or Dara: Consumer installment-sale debt for purchase of 
household appliances, United States 

Source or Dara: Survey of Current Business, Vol. 22 (1942), pp. 9-25, 17. 


WI (2) (3) (4) (5) (6) 
w m thl 
Der | Mom’ | Index of | rend times| Cycle, 
Year and month millions | variation, s pun ote [eee 
per cen! Yu ek TEEN 
Bor Av. = 100 | of SES EE 
1940: 
January 202 | 202.0 | 96.2 252.0 104.0 
February 255 | 262.8 | 94.0 247.0 103.2 
March. 253 | 263.6 92.5 243.8 103.8 
April.. 250 | 204.4 95.2 251.7 102.9 
May... 271 | 205.2 97.6 258.8 104.7 
June. 281 | 266.0 | 103.2 274.5 102.4 
July. 288 | 200.8 | 104.7 279.3 103.1 
August. 294 | 307.6 | 107.6 287.9 102.1 
September. 293 | 208.4 | 105.0 281.8 104.0 
October. 290 | 209.1 | 102.2 275.0 105.4 
November. 290 | 269.9 | 100.6 271.5 106.8 
December. 302 | 270.7 | 101.1 273.7 110.3 
1941: 
January... ia... T 290 E 261.2 111.0 
February ] 286 E 256.0 111.7 
286 a 252.6 113.2 
303 D 260.8 116.2 
320 7 268.1 119.4 
330 5 284.3 116.1 
336 3 289.3 116.1 
August 346 E 298.2 116.0 
September. ...... 342 9 201.8 117.2 
October... 333 E 284.8 116.9 
November. 320 E 281.2 113.8 
December. 313 yd] ege 283.4 110.4 
24 | 2910. | e 270.4 108.7 
285 | 281.9 E 265.0 107.5 
272 | 282.7 261.5 104.0 
258 | 283.5 209.9 95.6 
241 | 284.3 277.5 86.8 
219 | 285.1 294.2 74.4 
.302 | 285.9 67.5 
183 | 286.7 59.3 
109 | 287.4 56.0 
October. . 
November 
December, 
EMI 


* Equation of monthly trend: y^ = 219.0 + 0.796t (origin July, 1935). 
+ Necessary to copy this for only one year; this seasonal variation was calculated for 
illustrative purposes in Chap. XXIII, pp. 625-631. 
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involves the removal of trend and seasonal variation from the 
raw monthly data, The method is devised to produce an index 
or relative expression of raw data compared with what would be 
expected if only trend and seasonal variation were present in the 
data. Such analysis gives an answer to the question: How 
far are the raw data from month to month above or below what 
they would be if they were following the usual course of seasonal 
variation and the expected trend? In effect, the process is the 
reverse of that illustrated by a hypothetical series at the begin- 
ning of Chap. XX. The theory underlying the method of 
adjusting monthly data for seasonal variation and trend con- 
tains no tricks new to the student after mastering the material 
in the preceding chapters. It is quite simple in concept, though 
perhaps the arithmetical calculations involved are somewhat long. 
Illustration of Method of Determining Cycle in Monthly Data. 
It should be pointed out at the start that measuring the cycle 
in monthly data is measuring the same cycle as was measured 
in the annual data, if the same trend is used. In this illustration 
the raw monthly data will be adjusted for the straight-line trend 
and for seasonal variation, so that the resulting series of monthly 
data will correspond to the figures given in the last column of 
Table 100, except that they will be monthly instead of annual 
figures. "The annual averages of the monthly adjusted data 
obtained in this illustration should be equal to the annual 
percentages shown in the last column of Table 100. 
Table 101 is a work sheet drawn up for the purpose of making 
the necessary calculations, on the assumption that (1) the index 
of seasonal variation and (2) the equation of trend on an annual 
basis have been calculated. The data used are monthly figures 
for consumer installment-sale debt for household appliances in the 
United States, 1929-1942. In the illustration only the period 
1940-1942 is presented. It would be a good iaboratory exercise 
for the student to work out the rest for himself and plot the 
resulting adjusted figures. 
Column (2) of Table 101 contains the raw monthly data 
tabulated from the source.* 
If a trend equation is in terms of annual figures, and the 
annual figures used are annual totals, the monthly trend equation 
will be 


1 HOLTHAUSEN, op. cit. 
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But if the annual figures are annual averages of monthly data, 
then the monthly equation of trend will be 


S dnd 
EE 


Thus, b must be divided by 12 because in the monthly trend 
equation the annual increment is distributed among 12 parts 
(t now stands for months instead of years). In other words, if b 
is the annual increment, b/12 is the monthly increment. But if 
b is the annual increment of total annual data (sum of the 12 
months each year), then to put it on a monthly basis it is neces- 
sary first to convert it to a monthly figure by dividing by 12; 
it is then still an annual increment and has to be divided by 12 
again to obtain the monthly increment. 

In the trend equation, a can be assumed to be at June-July of 
the origin year, and of course the origin may beshifted by changing 
accordingly the value of a. The origin is at the middle of the 
year, i.e., between June and July. For example, the equation 
of trend found for the annual data on consumer installment-sale 
debt for household appliances is (in millions of dollars) 


y! = 218.6 + 9.53¢ (origin at 1935) 


The data used in this illustration are annual averages of monthly 
data; so the monthly trend equation is (in millions of dollars) 


y’ = 218.6 + 0.796 (origin at June-July, 1935) 


in which unit of tis 1 month. 

By adding algebraically half a monthly increment to 218.6, 
the origin is shifted to July, 1935, and the approximate equation 
of monthly trend is as follows (in millions of dollars): 


y! = 219.0 + 0.796: (origin at July, 1935) 


Solving this equation for different values of £ (from ¢ = 54 at 
January, 1940, to t = 86 at September, 1942) gives the various 
monthly values of y’ shown in column (3) of Table 101 under the 
caption Monthly Trend. Column (4) shows the index of sea- 
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sonal variation, which in Chap. XXIII was calculated by the 
12 months’ moving total method. Column (5) shows seasonal 
variation and trend combined by multiplication, and column (6) 
is obtained by dividing the monthly items in column (2), the 
raw data, by the monthly items in column (5). By this last 
operation, both trend and seasonal variation are removed from 
the raw data; the resulting index gives an idea of how high or low 
the raw data are in comparison to what they might be expected 
to be according to usual seasonal variation and trend. 

Data thus treated over a series of years disclose information 
about the time series that it is not possible to visualize from the 
raw figures. It makes possible the comparison between cyclical 
and minor cyclical fluctuations in time series otherwise con- 
cealed by disturbing elements of seasonal variation and trends. 
If this monthly analysis of the data on consumer installment-sale 
debt for household appliances in the United States were done 
for the entire period 1929-1942, the picture of monthly data 
would, of course, resemble the broken line of Fig..151. The 
annual averages of the monthly data, which contain cyclical 
movements mixed with residual movements, would be equal to 
the figures shown in the final column of Table 100. In this con- 
nection it is to be noted that the annual averages of column (6) 
in Table 101 are equal to the corresponding annual figures in the 
last column of Table 100; for 1940 the annual average of the 
figures in column (6) of Table 101 is equal to 104, and for 1941 
the annual average of the figures in column (6) of Table 101 
is equal to 114. 

From the results of calculations in Table 101 it may be con- 
cluded that consumer installment-sale debt for household 
appliances reached the peak of a cycle in May, 1941, remaining 
10 to 17 per cent above normal throughout 1941. In 1942 a 
sharp decline materialized; in fact, this decline, on a monthly 
basis, was rapid after October, 1941. .The raw data appear to 
indicate that the peak of the cycle occurred in August, but this is 
due to the effect of seasonal variation and trend. When sea- 
sonal variation and trend are taken into consideration, the 
cyclical peak is found to be in May, 1942. From July, 1940, to 
August, 1940, the raw data show an increase, but a cyclical 
decline occurred in that period. Removal of seasonal variation 
and trend makes it apparent that the appearance of a rise from 
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July, 1940, to August, 1940, was due to seasonal influences and 
trend. 

Adjustment of data by removing seasonal variation and trend 
makes it possible to judge quickly whether or not consumer 
installment-sale debt for household appliances is rising (or falling) 
more rapidly than seasonal variation and trend would lead us to 
expect. The resulting figures are frequently described when 
published by saying that the data are “adjusted for seasonal 
variation and trend.” Sometimes, if trend is unimportant or 
of dubious character, only seasonal variation is removed and the 
data are described as “adjusted for seasonal variation.” Charts 
of such data appear frequently in financial publications and in 
the financial sections of metropolitan newspapers. 

Measuring the Cycle Where Trend Is a Second- or Third-degree 
Polynomial. The rational growth of some data is better described 
by a second-degree polynomial, as discovered in Chap. XXI. 
When such is the case, it would be necessary to use a second- 
degree polynomial instead of a straight-line trend as illustrated 
in the preceding sections of this chapter. 

A third-degree polynomial is not likely ever to resemble a 
rational growth element in a time series, but it may resemble the 
conformation of the major cycle during a specified period covered 
by data that are being analyzed. If it is desired not only to 
remove growth trend but also to remove from the data the effects 
of the major cycle in order to observe residuals that might be 
significantly described as a short cycle, the method described in 
the preceding sections could be used with monthly data, apply- 
ing the same principles to the removal of a third-degree poly- 
nomial trend combined with seasonal variation that were applied 
to the removal of a straight-line trend combined with seasonal 
variation. 

The general form of the second-degree polynomial annual 
trend is y’ = a + bt + cÈ where tis 1 year. The equation of the 
monthly second-degree polynomial trend, where the annual 
data are annual averages of monthly data, would be 


uen Jap Cp 
0 sq l Sida: 


where t is 1 month. 
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The general form of the third-degree polynomial annual trend 
is y' = a + bt + cË + dë where t is 1 year. The equation of 
the monthly third-degree polynomial trend, where the annual 
data are annual averages of monthly data, would be 


b c d 
D 2 KI 
y = at sgt + qun T pras! 


3 


in which / is 1 month. 

If the data are annual totals, instead of annual averages, 
every item on the right side of the respective equations will be 
divided by 12.* 

Danger in Extrapolating Trends. In the illustration on page 
641 it was noted that the second-degree polynomial trend, if 
extended beyond the year 1941, would quickly go up to infinity. 
Thus the extrapolation, or extension, of this trend beyond 1942 
very soon becomes an absurdity. This shows the need for 
caution in the projection, or extrapolation, of empirical trends. ' 
Their projection for short periods of time (how long depends 
upon the conditions of each particular case) is a valuable aid in 
constructing barometric indexes. 

A troublesome unsolved problem in time-series analysis is to 
know when trend is changing and also, for that matter, when 
seasonal variation may be changing. Neither the statisticians 
nor the economists have solved this problem, but they realize 
that it is ever present in time-series analysis. It is desirable, 
therefore, to be cautious about extending empirical trends into 
the future and to reexamine monthly data for seasonal variation 
at frequent intervals. A method for detecting changing trends 
in seasonal variation was explained and illustrated in the pre- 
ceding chapter. 

Method of Ratios vs. Method of Differences. In general, the 
method here presented for removing one or more types of varia- 
tion from time series has been the method of division, or ratios. 
In other words, the raw data are expressed as percentages of 
computed trend and seasonal variation. This is not the only 
method of removing trend and seasonal variation from the 
monthly or annual raw data. Another type of approach is called 

* See pp. 644-645. 


1 For further discussion in connection with economie forecasting, see 
Chap. XXV, pp. 661-671. 
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the “method of differences," which, in one of a number of forms, 
may be summarized as follows: 

1. Assuming that the index of seasonal variation and the 
monthly trend have been calculated by one of the conventional 
procedures, the monthly trend values are multiplied by the 
index of seasonal variation. 

2. The trend multiplied by seasonal variation (yj) are now 
subtracted from the raw data (yi). 

3. This gives a series of yi — yi, Y2 — Yn » + being the 
arithmetical amount, in original units (pounds, dollars, ete.), 
by which the raw data are greater each month, or less, than the 
computed value for trend multiplied by the index of seasonal 
variation. ` These residuals of the raw data from trend and 
seasonal variation form a series 71, 7e, 7s, . . . , "a that would be 
a time series fluctuating arithmetically above and below zero, 
according to whether the raw data were above or below trend 
multiplied by seasonal variation, that is to say, according to 
whether the raw data were greater or less than values expected 
in view of the anticipated growth and seasonal fluctuation. 

4, These residuals are in terms of the quantity units of the 
raw data. Consequently, it would be very difficult to compare 
the residual fluctuations of a series measured in bushels (say 
wheat production) with the residuals in dollars (say the price of 
wheat). It is necessary to find a common denominator in order 
to compare the residuals in various time series, obtained by the 
arithmetical difference method. 

5. The common denominator used is the standard deviation 
in the residuals. c, will be simply 4/Zr?/N, since their arith- 
metical average is zero. Each r divided successively by e, would 
give a series in terms of standard-deviation units that could 
thereafter be compared with other time series similarly treated. 
Various series, whether the original units were dollars, pounds, 
inches, etc., will now be*reduced to terms of standard-deviation 
units and can all be plotted on the same scale, namely, a scale 
that is calibrated in standard deviations. 

One important disadvantage in the method of differences 
persists even after the residuals are expressed in terms of their 
own standard deviations. The residuals will tend to be arith- 
metieally greater when trend is at high values and arithmetically 
small when trend is at low values. This means that the impres- 
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sion will almost universally arise from such an analysis that as 
things grow they become subject to more violent fluctuations. 
Tn fact, when considered relative to the magnitudes from which 
variation occurs, these greater arithmetical variations may be 
really less important. The ratio method places at all times the 
proper proportional emphasis upon arithmetical fluctuations by 
expressing them as a ratio to the trend and seasonal variation. 

On the other hand, by the same token the ratio method may 
tend to minimize the importance of fluctuations; for. it may be 
possible that the proportional amount of change is not so sig- 
nificant as the actual amount of change. For example, the fact 
that the amount of unemployment is no greater proportionally 
may not necessarily dispose of the fact that the actual amount of 
unemployment at some particular time is very great and the 
corresponding personal problems distressing in the extreme. 

Whatever method of statistics is used, it is necessary for the 
analyst to keep his eyes open to the effect the method itself may 
have upon his results. 


PART VI 


Forecasting 


CHAPTER XXV 
THE ART OF FORECASTING WITH STATISTICS 


INTRODUCTION 


Prevalence of Forecasts. Ancient Origin of Pseudoscientific 
Forecasts. The human desire to look into the future led, even 
in ancient times, to the rise of various forms of pseudoscientific 
forecasts. Oracles were frequently consulted as to the outcome 
of a contemplated military campaign, business venture, or love 
affair. Among the most famous of these was the Delphian 
oracle. Astrologists were, and still are, consulted for what the 
stars have to say; one of their most prominent devotees in 
modern times is said to be Adolf Hitler. 

It was partly to disprove some of these astrological notions 
that statistical method was first undertaken on a scientific 
basis. In the seventeenth century an idea prevailed that the 
phases of the moon influenced health; also, health was supposed 
to be critical every seventh year and life particularly hazardous 
at the ages of forty-nine and sixty-three. Near the end of the 
seventeenth century studies of vital statistics by Capt. John 
Graunt of London and Casper Neumann of Germany disproved 
the connection between health and the phases of the moon as 
well as the fateful significance of every seventh year in life. 
Other similar superstitions were “debunked” by statistical 
studies. From the beginning of the history of the modern 
money market, attempts have been made to devise some way 
to forecast the course of financial affairs. For the Antwerp 
Bourse in 1543, Christopher Kurz is said to have contrived an 
astronomical method of making prophecies about the money 
market.' 

1 EHRENBERG, R., Capital and Finance in the Age of the Renaissance, 
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Modern Scientific Forecasts. Although forecasting was thus 
once the special prerogative of soothsayers, today it has been 
placed upon a broader basis by the development of science. 
For one of the objectives of science is, precisely, to forecast. 
Science seeks to classify and determine relationships that may be 
used for purposes of prediction. Every scientific law is, in a cer- 
tain sense, a forecast. It foretells what will happen under certain 
circumstances. The law of gravitation says, for example, that 
if a ball is dropped from a tall building it will fall with an acceler- 
ation of 32 feet per second per second. Boyle’s law says that 
the pressure in a given container varies, and will vary, directly 
with the temperature and indirectly with the volume. Scientific 
astronomy makes it possible to forecast the tides, to construct the 
calendar for our mundane affairs, and, in addition, to forecast 
celestial events such as the date of the next visit of Halley’s 
comet. There are no “ifs” or “buts” about the modern scientific 
forecasts in the realm of the natural or physical sciences. 

Popular Dramatization of Forecasts. The depression of the 
1930’s did more than hundreds of books could have done to make 
people cycle-conscious. So general was the interest in cyclical 
behavior that by 1940 the Foundation for the Study of Cycles 
was set up as a nonprofit organization with an international 
committee composed of scientists and businessmen. This 
foundation proposed to help in the task of integrating the work 
of the thousands of scientists and statisticians who are con- 
tributing in various fields to the study of cycles. Not only have 
cycles been found to exist in the realm of business activity, but 
scientists in many other fields believe they have discovered 
cyclical behavior in their respective studies. For example, 
psychologists have discovered that human beings have regular 
ups and downs in their emotional life, following a cyclical 
pattern. Biologists have discovered what appear to be regular 
fluctuations in animal, insect, bird, and even fish populations. 
In 1937, Prof. William Hamilton of Cornell University, upon the 
basis of eycle studies, warned farmers and housewives of New 
York State to prepare for a scourge of mice in the winter of 
1939-1940 and for another outbreak in 1943-1944. While it 
may still be too early to put the stamp of final scientific approval 
upon all these cyclical discoveries, they are nevertheless making 
important contributions to knowledge. 
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Some of the twentieth-century discoveries sound almost like 
the pseudoscientifie superstitions of the Middle Ages. Thus in 
1943 the public was advised “to look for skunks under your 
front porch about 1945.” It was claimed that an answer could 
be given to such questions as: Will you feel happy or gloomy a 
month from today? Such statements were made as: If you 
are born in January, February, March, or April, the chances 
are you will live longer than people born in July, August, or 
September.) These notions about forecasting suggest a precision 
in statistical forecasts that they probably will never possess.” 

Conditional Scientific Forecasts. A forecast, to be scientific, 
does not have to be unconditional; in fact, most forecasts in the 
realm of the social sciences and some in the realm of the physical 
sciences are hypothetical in character. Indeed, in its largest 
sense, forecasting must be taken to mean prediction of not only 
what will happen but what would happen under given hypo- 
thetical conditions. Not only must the predictions of the 
meteorologist and stockbroker be considered forecasts, but also 
predictions of the engineer as to the outcome of certain plans 
and the warnings of the economist as to the effect of certain 
proposed actions of Congress are forecasts. The latter are 
conditional forecasts. 

Many predictions of coming events are hedged in by all sorts 
of weasel-like conditions. It may be said that private enterprise 
will disappear if Republicans are not elected. Or an economist 
may predict that a Congressional increase in tariff rates will 
cause exports to decline, provided that foreign countries do not 
offset our higher tariff by giving bounties to their exporters, or 
that foreign demand for American products does not increase 
for some unforseen reason, or that American exports do not 
become less costly to produce. Such forecasts are conditional, or 
hypothetical, in character. 

The practical worth pf a forecast depends, not on whether 
it is conditional or unconditional, but on how much knowledge 
the forecaster actually has of the relevant conditions. An uncon- 
ditional forecast may be merely a wild guess and have little 


1 Dewey, Epwarp Russet, “Science Predicts the Future,” American 
Magazine, Vol. 136 (1943), pp. 90-92. 

2 See More Exact Forecasting and Less Exact Forecasting, pp. 659-661. 
See also Surpa and Duncan, Sampling Statistics and Applications. 
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value, in spite of its uncompromising and categorical appearance, 
while a carefully drawn conditional forecast may be of great value, 
in spite of its “pussyfooting " aspect. In the case of the latter, it 
may be that the likelihood of the conditional factors is very 
slight and that they are mentioned only to guard the forecaster 
from unwarranted criticism. On the other hand, if the disturb- 
ing factor has a fair likelihood of occurrence, the nature of its 
effect might be forecast so that the recipient of the forecast 
could be on his guard against this factor; by watching it, he 
might know when to abandon his faith in the original forecast. 
For example, if a prediction of rain tomorrow is based merely 
on the fact that it looks somewhat cloudy today, the forecast 
would probably be of little value (in the sense that such forecasts 
would probably be wrong more often than they were right). 
[n contrast, if a trained observer predicted rain after a thorough 
observation of the weather situation, this would have consider 
able value even if he hedged his prediction by saying that the 
rain might not occur if the wind in a neighboring area shifted 
before a certain time. 

Qualitative vs. Quantitative Forecasts. Most forecasts are 
qualitative in character. The meteorologist says it will rain 
but does not always say how heavily. The economist may 
predict that the effect of an increase in the tariff will be to raise 
prices, but he does not often say to what degree. The meteorol- 
ogist, on the contrary, may give the approximate time when rain 
is expected and how many inches are expected to fall; and the 
economist may try to estimate the average foreseen rise in prices. 
The latter would be quantitative forecasts. 

It will be noted that forecasts may be quantitative in two 
ways, with reference to the degree of the predicted change and 
with reference to the time of occurrence. The success of fore- 
casting must be judged, not only on the basis of whether the 
forecast was correct, but also on how far the forecast went in 
actually describing the future event—its quantity and its timing. 

Illustrations of Modern Forecasts. In the modern world, 
forecasts are applied in many fields. Predictions of astronomical 
events, as already indicated, have been among the earliest and 
most successful forecasts. The movements of the moon, the 
planets, and other heavenly bodies have been computed with 
considerably accuracy so that their future course may be pre- 
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dicted with great precision. Forecasts of certain eclipses, 
for example, have been only a few seconds in error in timing. 
In this connection, it is interesting to note that the theory of 
least squares was largely developed in the attempt to forecast 
the paths of the heavenly bodies. 

Closely akin to astronomical forecasts have been forecasts 
of weather conditions. Short-range forecasts are based mainly 
on wind conditions and barometrie pressures, but long-range 
forecasts are sometimes attempted from the study of rainfall 
data, sunspots, and the like. In some instances, studies of 
growth rings in old trees have yielded weather data going back 
many years. "These studies usually look for cyclical fluctuations 
that will indicate periods of high and low activity and permit 
long-range forecasting. Studies of average weather conditions 
and the dispersion around these averages also afford forecasts 
of the variability of conditions in different areas and hence 
suggest the more desirable airports and air routes. 

Engineers make many forecasts. A water-power engineer 
will forecast the amount of power to be obtained from a dam of 
given size built in a given river. Another engineer may predict 
the breaking strength of given kinds of wire at different tempera- 
tures. Still another may predict the maximum load to be 
sustained by a given bridge. š 

From the laws of Mendel, biologists make predictions of 
results to be expected from crossbreeding. Agronomists will 
predict the average results to be obtained from the use of certain 
fertilizers, or certain methods of cultivation, or certain varieties 
of crops. Agricultural economists attempt to predict the 
effect of certain sized crops on the future prices of important 
commodities or the effect of certain prices on the future volume 
of production. 

Business economists attempt many kinds of forecasts. From 
studies of factors closely’ related to the sale of a certain product 
in a given area where the trade has been well established, fore- 
casts may be made of the sales to be obtained from new untapped 
areas of similar character. Other economic forecasts aim to 
predict the ups and downs of the business cycle in various lines 
of activity. Probably the greatest percentage of economic fore- 
casts are devoted to predictions of the stock market—money 
rates, bond prices, and security prices. 
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These are but a few of the examples of forecasting. It is 
probably true that forecasting in all its ramifications is pandemic 
in modern life. 

Use of Statistics in Forecasting. This chapter attempts to 
outline the use that may be made of statistical analysis in making 
forecasts. Details of the methods of forecasting are beyond the 
scope of this volume, which is not a book on forecasting but 
merely includes a chapter on the pattern of methods used in fore- 
casting. A few examples will be given to illustrate these 
methods. ‘The aim here is primarily to indicate the application 
of different statistical techniques to the problems of forecasting. 

Statistics affords a basis for forecasting in two principal 
ways. (1) By studying monovariate and multivariate frequency 
distributions, statistics are used to forecast average results and 
the type and degree of dispersion around these results. (2) By 
means of time-series analysis, statistics are used to predict the 
course of events in time. Each of these applications to fore- 
casting will be discussed in the ensuing pages. 


FORECASTS, FROM DISTRIBUTION STUDIES 


Forecasts from Monovariate Distributions. If considerable 
data have been obtained, forecasts from monovariate distribu- 
tions may yield good estimates of the mean, standard deviation, 
coefficient of skewness, and kurtosis of the population from 
which the data were derived. If such is the case, these popula- 
tion estimates may be used to forecast the character of future 
samples drawn from the given population. 

Familiar matters relating to family care and health conven- 
tionally rely upon forecasting by use of monovariates. Suppose 
the frequency distribution, or monovariate, represents the 
weights of boys of specified age. The mean of that distribution 
is presumably normal weight for that age; the standard devia- 
tion and kurtosis describe expected variability. From such a 
monovariate and its statistics, it is commonly inferred whether or 
not a child is under normal weight and, if so, whether or not this 
deficiency is sufficient to cause alarm. Taken with other 
evidence it may be the basis for the application of timely thera- 
peutic action. 3 

In social control, monovariate distributions are used to 
standardize products involving the presumption of forecasting. 
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The fat content and standard deviation in fat content of milk 
that has been produced and sold in the past constitute a set of 
standards according to which it is ruled that milk sold in the 
future will conform; thus milk is graded according to standards 
found by frequency-distribution analysis. Methods of weight 
or content are used by the Bureau of Standards to set standards 
for many products, both in the raw state, such as grains of wheat, 
and in the final product, such as bread; and abnormal variations 
from these standards in the market are not permitted. 

In business the use of monovariate distributions for fore- 
casting is widespread. For example, the distribution in sizes 
of shoes sold by a retailer is used as the basis for forecasting his 
future sales and for determining his reorders of additional shoes. 
In such forecasting, the businessman is interested in forecasting 
each class in the distribution rather than in the distribution’s 
average and standard deviation. A similar forecasting pro- 
cedure is used by any retailer when he purchases articles that 
are sold by size, which include most articles of clothing. The 
wholesaler and the manufacturer also are interested in the same 
type of forecasting, so that they may profit by having the 
appropriate number of articles of various sizes continually 
ready for the consumer—if the articles are there, ready for him 
to buy when he comes, a minimum of consumers’ sales will be 
Jost 

Forecasts from Bivariate Distributions. Bivariate data may 
likewise yield estimates of a bivariate population that may make 
it possible to forecast results of future samples. Suppose, for. 
example, that it is found from a study of army records that there 
is a high correlation between the Army General Classification 
Test scores and the results obtained in a given electrical course. 
To be specific, suppose that the bivariate distribution of these 
two variables appears to be normal in form and it is estimated 


1 For another illustration of the use of monovariates in forecasting, see 
Robert J. Myers, *Comparison of Demographie Rates Assumed by the 
National Resources Committee with Actual Experience,” Journal of the 
American Statistical Association, Vol. 38 (1943), pp. 201-209; also, for an 
example of such forecasting for the purpose of control of quality of manu- 
faetured product, see William B. Rice, “Quality Control Applies to Busi- 
ness Administration,” Journal of the American Statistical Association, 
Vol. 38 (1943), pp. 228-232; cf. W. A. Shewhart, Economic Control of 
Quality of Manufactured Product (1931). 
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that the line of regression of electrical grades op Army General 
Classification Test scores is 


E=5+087 


with a first-order standard deviation of seven points. From 
this it is possible to make certain predictions regarding the future 
results of men taking the electrical course. Thus, it might be 
predicted that men who got 90 on the Army General Classi 
fication Test will get an average grade of 77 on the electrical 
course, half of them getting less than this and half getting more. 
If 70 is taken as the passing grade, it may be predicted that 
something like 84 per cent of the men whose score is 90 will 
pass the course, 70 being one standard deviation less than the 
average and the distribution being normal.1 

Forecasts from Multivariate Distributions. Study of multi- 
variate distributions may permit the same sort of forecasting as 
the study of bivariates. Suppose that a study of army salvage 
data shows that the length of service of wool socks is closely 
related to the amount of marching required of the troops and the 
average temperature of the area. Suppose, for example, that 
the plane of regression derived from the data is as follows: 


Length of life = 40 days — 3 (average miles marched per day) 
+ 2 (average temperature) 


Then, if the average miles marched is increased by 2 miles per 
day, the Army may predict that the average length of life of 
wool socks will probably decrease by 6 days. Or again, if the 
troops are shipped to an area in which the average temperature 
is 10 degrees warmer, the average length of life of the wool socks 
will be increased by 30 days. Or, in a third instance, if it is 
planned to send a force to a given area for which the average 
temperature is approximately 60 degrees and it is expected that 
the troops will march about 20 miles per day on the average, 

1 For illustrations of the use of bivariates for prediction in business, see 
Patricia Daly and Paul H. Douglas, “The Production Function for Cana- 
dian Manufaetures," Journal of the American Statistical Association, Vol. 
38 (1943), pp. 178-186; also see pp. 674-675. 

2 The correlation between length of life and temperature is assumed to 
be positive on the ground that, the warmer the climate, the less the use 
that would be made of wool and the more the use that would be made of 
cotton socks. See the discussion on planes of regression, pp. 448-455. 
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then it should have sufficient supplies of wool socks on hand to 
provide for new issues every 100 days on the average. If the 
study indicated that the first-order standard deviation about the 
plane of regression was about 5 days, the Army might keep on 
hand a large enough supply to replace socks every 85 days (that 
is to say, 100 minus three times the standard deviation, or 
100 — 15 = 85). 

Errors of Forecasts. In concluding this section on the use 
of monovariates, bivariates, and multivariates as forecasters, it 
must be noted that forecasts of the kind indicated are necessarily 
inexact. ‘They are based on the assumption that the population 
is exactly known. When the population characteristics them- 
selves have to be estimated, as they usually do, then the fore- 
casts based upon these estimates will suffer from all the errors 
involved in the latter. The more refined analysis that is required 
to take care of these errors of estimate of the population is 
beyond the scope of this book. It is sufficient here to point out 
that the error of forecast based upon estimated population char- 
acteristics is greater than that based upon a known population. 
For example, if a plane of regression based upon sample estimates 
has a related standard deviation of five points, the probability of 
a forecast based on the plane being off by as much as two times 
five points in either direction (therefore, 2c) will be, not the 
normal 5 per cent, but perhaps 10 per cent or more. Every- 
thing depends on the size of the sample from which the original 
estimates of the population characteristics were made. 


FORECASTING TRENDS WITH TIME SERIES 


More Exact Forecasting. If much is known about a par- 
ticular time series, so that the nature of its growth and cyclical 
movements can be fairly well determined from rational con- 
siderations and if the remaining fluctuations are apparently 
random, forecasting front such a series can be put on much the 
same level as forecasting from distributions of the monovariate, 
bivariate, and multivariate type discussed above. Careful 
estimates may be made of the growth, and these may be extra- 
polated for a short period of time into the future. Distribution 
analysis of the random fluctuations will determine the range of 


1 For the summer season this would be less than for the winter season, 
but a stock level based on a 100-day turnover might be taken as normal. 
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fluctuations around the growth curve and will afford an estimate 
for the error of a forecast based solely on the growth element in 
the series. 

Suppose, for example, that a logistic form of growth appears to 
be very logical for a certain type of data. If the data have 
reached a certain stage of development, the values of the next 
few periods may be forecast from an extrapolation of the logistic 
curve fitted to the past data. The amount of error in the fore- 
cast resulting from the random fluctuations around the normal 
growth may be estimated from the standard deviation of th 
residual fluctuations of the data from the fitted logistic curve. 

Illustrations of the type of time series that would permit 
fairly exact forecasting are afforded by Fig. 54 in Chap. V and 
Figs. 144 and 145 in Chap. XXI. 

Less Exact Forecasting. The real difficulty in most time 
series analyses is to determine what is random and what is not 
random. Furthermore, it is often hard to work out any rational 
basis for specific forms of the trends and cycles. In cases 
where there is no particular trend indicated by the rationale of 
the situation, forecasts must be of a rough-and-ready sort, and 
little can be done to determine the error of forecast. 

Economic time series are generally of the sort that do not 
permit more exact statistical forecasts.2 For this reason statis- 
tical analysis is usually only one of the elements entering intc 
the making of economic forecasts. In some cases it plays a more 
important role than others, but nearly always the forecaster 
incorporates his statistical findings into a general appraisal of 
the situation. As indicated above, statistical analysis in these 
cases is itself largely intelligent guessing. The statistical part 
of an economic forecast is consequently merely the quantitative 
ingredient of the final forecast. 

Public utilities, especially the telephone companies, are 
keenly interested in the subject of forecasting growth or trend 
elements in time series. In the telephone business the laying 


1 See discussion on rational vs. empirical trends, pp. 550-565. 

2 'TINTNER, GERHARD, “The Analysis of Economie Time Series," Journal 
of the American Statistical Association, Vol. 35 (1940), pp. 93-101. WALLIS, 
W. ALLEN, and Georrrey H. Moore, “A Significance Test for Time Series 
Analysis,” Journal of the American Statistical Association, Vol. 36 (1941), 
pp. 401-409; Vol. 38 (1943), pp. 153-164. 
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out of plans and the construction of new exchanges necessitate 
some sort of forecast as to the future growth of the community. 
For years these companies have maintained elaborate and 
efficient research organizations whose business it is to forecast 
trends in growth of population as well as the geographical 
distribution of various types of business and residential areas 
in the communities served. 

Most business enterprises, however, are more concerned 
about cyclical fluctuations than about trend or growth in time 
series. For this reason the greatest number of published fore- 
casts have to do with the prediction of cyclical movements in 
business conditions. 


FORECASTING CYCLES WITH TIME SERIES 


All that has been said about the inexactness of forecasting 
trends by the use of time series applies equally to the forecasting 
of cycles with time series. Nevertheless, the practice of relying 
upon statistics as an aid to business is now so prevalent that 
statistics, along with accounting, has become one of the standard 
tools and one of the essential means of internal control of nearly 
all economic enterprises, as well as a guide to public policies of 
governmental agencies. Among its many commercial uses, 
business forecasting is one of the most important, and it is along 
this line that statistical analysis has been intensively developed 
in recent years. Today there are several important agencies 
that supply forecasting services. Among these are Standard & 
Poor’s Corporation, Brookmire Economie Service, Moody’s 
Investor's Service, Babson, and the Harvard Economic Society. 
Im addition, many commercial banks such as National City, 
Cleveland Trust, and Chase National include forecasts of prob- 
able future business trends in their monthly letters. Supple- 
menting these professional services are the statisticians and 
statistical departments ‘of many large corporations, such as the 
American Telephone and Telegraph Company, which make fore- 
casts for their own use. 

American activity in this field has been internationally con- 
tagious. As early as 1921 the publication of the Economic 
Bulletin of the Conjuncture Institute was begun in Moscow; this 
publication was devoted to the study of business cycles and 
to the analysis of Russian statistics. Subsequently in nearly 
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every important European country during the 1920’s and 
1930’s forecasting services were organized, sometimes by the 
large universities. The League of Nations showed its intention 
of inaugurating forecasting on a world scale by appointing a 
Committee of Experts on Economic Barometers.’ 

The many possible occasions when forecasting is required in 
modern business can be shown by a few examples. A com- 
mercial banker granting a loan must forecast the probability 
of its being repaid; his judgment in this respect will depend on 
his forecast of the borrower's future earning power; this, in turn, 
depends on his estimation of probable future price stability 
in the borrower’s business. Similarly, a collateral loan will 
involve a prediction, more or less precise, of the future value 
of the security offered as collateral. A manufacturer needs to 
forecast probable sales and probable prices of his own goods 
and of materials he has to purchase, so that he can profitably 
plan production and plant expansion. A public-utility operator 
needs to forecast population and industrial trends, construction 
and operating costs, and probable prices for the service, in 
order to determine when and where to build a railroad line, a gas 
main, a power plant, or a telephone exchange. 

All these things are commonplace in economic life, but the 
growing complexity and interdependence of economic society 
have made it increasingly difficult for the average businessman 
to comprehend an existing situation in trying to formulate his 
programs for the future. He is not a statistical expert. His 
knowledge of methods of summarization and comparison goes 
usually little beyond a vague comprehension of averages. To 
aid him, it is the purpose of the various business forecasting 
agencies “to provide the basis for business, financial, and security 
market policy. Regardless of the inevitable margin of error 
in every forecast, business, financial, or security market policy 
which is geared to only a fairly intelligent estimate of future 
probabilities is more likely to be sound than is policy geared 
only to guess, or to no forecast whatsoever.” ? 


1 Cox, G. V., An Appraisal of American Business Forecasts, pp. 1-2. 

2A Forecaster’s View of Forecasting," Standard Statistics, (June 15, 
1931), p. 14. Also see Brarr, Ermer C., Business Cycles and Forecasting 
(1941), pp. 736-800; Harpy, CHARLES O., and GARFIELD V. Cox, Fore- 
casting Business Conditions (1927). 
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Forecasting General Business Conditions. One of the most 
important objects of economie forecasting is to predict general 
business conditions, that is to say, the cyclical position of general 
business. Good times and bad times are such important 
elements in determining the prosperity of individual lines of 
activity and of individual firms that the prospect for general 
business is probably the first thing any corporation executive 
wishes to know. Statistically, general business is properly 
measured by some index of all business activity. It is the sum- 
mation of the whole and not merely one of the parts, although 
an index of a part, say an index of industrial production, may be 
taken as a barometer of the upswings and downswings of the 
whole. Such series are commonly called “business barometers.’’! 

Forecasts of general business conditions are based upon one 
of two forecasting methods or a combination of the two. The 
first method is known as “historical analogy," the second as 
* erosscut analysis."? 

The method of historical analogy is based on the assumption 
that in cyclical fluctuations history tends to repeat itself. In 
its cruder forms, this consists merely in forecasting the course of 
general business, subsequent to some disturbance, from the 
course of general business that followed a similar disturbance 
in the past. For example, the forecaster might undertake to 
predict the course of general business following the crisis of 1929 
from the course of business following the crisis of 1873. In 
more carefully thought-out form, historical analogy becomes a 
business-cycle theory that attempts to explain how the interplay 
of economic forces causes general business now to rise and now 
to fall. 

Crosseut analysis proceeds on the basis that the business 
situation is never the same and that each new upswing or down- 
swing is the product of a set of factors different from those 
previously operative. ° To understand the given situation all 
the factors must be weighed as to their importance and a net 
appraisal of the situation derived. 


1 See pp. 530-535. 

2For more elaborate classifications see Bratt, op. cit., pp. 736-760; 
Haney, L. H., Business Forecasting (1931), p. 195; Day, E. E., “The Rôle of 
Statisties in Business Forecasting,” Journal of the American Statistical 
Association, Vol. 33 (1938), p. 2. 
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In good forecasting, both methods are employed. Ifa certain 
cyclical theory appears to constitute a good explanation of past 
events, it is good forecasting practice to consider it in predicting 
future cycle changes. Nevertheless, careful study must be 
made to see whether the role played in the past by a particular 
industry or economic development is subsequently being played 
by some other industry or development. The user of historical 
analogy must always, therefore, be on guard for changes required 
in the statistical embodiment of the cyclical theory on which his 
analysis is based in order to keep it up to date in its assumptions. 
During the railroad era, statistics about railroads dominated 
the scene as good indicators of general business conditions; 
later, it was statistics about automobile production; perhaps 
the time will come when it will be aircraft production. Again, 
the present era is often regarded as the “iron age." Statistics 
of iron and steel production are often used as barometers of 
general business conditions because so many of the products 
of the modern age depend upon iron. Perhaps the time will 
come when the emphasis will shift, from the point of view of 
statistics, away from iron and steel production to the production 
of the lighter metals such as aluminum. Who can say when the 
world of business is changing from the one to the other? 

Reflection along the lines indicated in the preceding paragraph 
leads to the conclusion that continuous crosscut analysis is 
needed as a means of verifying and justifying the use of the 
historical-analogy method. 

Forecasting by Historical Analogy. One type of forecasting by 
historical analogy makes extensive use of the’ fluctuations in 
particular time series that appear to lead general business con- 
ditions. Examples of series that have been used as business 
barometers are indexes of stock-market prices, changes in unfilled 
orders of the United States Steel Corporation, machine-tool 
orders, and the loan-deposit ratio. "These series, it is argued, will 
tend to lead changes in general business conditions, and important 
changes in general business conditions will first be made apparent 
by them. For example, a clear and consistent downswing in 
unfilled steel orders is presumed to presage a similar downswing 
in general business. Hence the latter is presumably forecasted 
from the former. In the case of the loan-deposit ratio, it is the 
attainment of certain critical levels that is significant; when high 
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levels are reached, for example, (7.e., when loans are high relative 
to deposits) strained credit conditions are in evidence and a 
crisis will be forecasted. 

More elaborate analyses making use of the historieal analogy 
for forecasting combine several economic series. A well-known 
example of such a combination is that prepared by the Harvard 
Economic Society and published in the Review of Economic 
Statistics. While the society itself makes no forecasts from its 
statistical series, they have been found useful for such purposes 
and it is generally understood that that is what they are published 
for. These are shown in Fig. 152. 

The Harvard series consist of three curves, known as the 
A, B, and C curves. The A curve represents speculation, the 
B curve business, and the C curve money. The actual data 
upon which these curves have been based vary from time to 
time. In those shown in Fig. 152, the curves are constituted as 
follows:! The A curve, speculation, is based on the price of all 
securities listed on the New York Stock Exchange. The B 
curve, business, is based on bank debits in 241 cities outside 
of New York City. The C curve, money, is based on short-term 
money rates. In each of the constituent series the trend and 
seasonal variation were removed (when it was deemed appropri- 
ate) before the final indexes were computed, 

The theory that underlies the use of the Harvard curves 
for forecasting is that changes in speculation will generally 
precede changes in general business and that the significance 
of these changes will be more clearly understood when the 
course of the money curve is noted. A sharp rise in speculation 
at atime when money rates are low and still falling would appear 
to forecast better business conditions. On the other hand, a 
fall in speculation at a time when money rates are rising would 
appear to forecast a decline in general business. If coupled with 
a detailed crosscut analysis of the current business situation, 
these curves are found very helpful in predicting general business 
conditions, 

The Harvard curves are but one set of curves that have been 
employed in this attempt to forecast general business conditions. 
Various combinations of curves have been used. A number 


1Frickey, Epwry, “Revision of the Index of General Business Condi- 
tions,” Review of Economic Statistics, Vol. 14 (1932), pp. 80-87. 
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make use of capital issue by private corporations and capital 
expenditures of the various government bodies. The idea 
behind the use of investment curves is that the volume of income, 
and hence business, is largely determined by the volume of 
investment. ; 

As the result of a great amount of researeh work during the 
past twenty or twenty-five years, mostly under the auspices 
of the National Bureau of Eeonomie Research or the National 
Industrial Conference Board, but also by scholars in the United 
States Department of Commeree, increasing attention has been 
given to methods of measuring business conditions based upon 
quantity and distribution of national income. Instead of indexes 
of production, employment, volume of trade, and the like, these 
new indexes attempt to measure national income and its distribu- 
tion, consumer expenditures and producer expenditures, savings, 
capital formation, and the like. Figure 153 gives a picture of 
annual consumer spending, 1919-1942, showing indexes con- 
structed by Kuznets (National Bureau of Economic Research) 
and by the United States Department of Commerce.' Figure 
154 is another illustration of the use of national-income statistics 
and their derivatives to show the cycle in general business 
conditions. This figure reproduces an index of that part of the 
national income devoted to expenditures for new durable goods 
and indexes of gross capital formation, net capital formation, and 
offsets to savings. The United States Department of Commerce 
index of private gross capital expenditures is presumably equiv- 
alent to Kuznet’s gross capital formation; to these are added 
indexes by Laughlin Currie reputed to measure income-producing 
Federal expenditures that offset savings and net government 
contribution to savings. The index of expenditures for new 
durable goods is constructed by the Board of Governors of the 
Federal Reserve System. 

Time and experiencé will reveal whether or not the national- 
income type of indexes proves to be better than the barometer 
or over-all measure of business activity types. 'The national- 
income type has been made possible by the increasing amount of 


1 HorrENBERG, Marvin, and MABEL S. Lewis, “Estimates of National 
Output, Distributed Income, Consumer Spending, Saving, and Capital 
Formation," Review of Economic Statistics, Vol. 25 (May, 1943), pp. 107- 
174, 124. 
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statistical data on income resulting from the administration of 
the Federal personal and corporate income taxes. 

Perhaps the greatest difficulty with all forecasting series is 
that the amount of the lead or lag is likely to vary considerably 
from time to time, so that the timing of the forecasted change 
becomes difficult. Another difficulty is to judge how great a 
change in the forecasting series must be before it is considered 
significant. The curve is almost bound to show minor ups and 
downs that are little related to general business. Presumably a 
movement either up or down must be great and persistent before 
any significant change is forecasted, but how great and how 
persistent is the question. ‘The answer to this question is always 
easy to read ex post facto, but in following the forecaster from 
month to month this is more difficult. If a lead is short and 
data are not reported quickly, a given forecasting series, con- 
sistent and reliable as it may be, is unlikely to have much fore- 
casting value since the change would be under way before it 
was manifested by the forecasting data. 

These difficulties apply in differing degrees to the various 
kinds of forecasters. In the case of the barometer type, which 
is ordinarily dependent upon one presumably indicating series 
such as unfilled orders of the United States Steel Corporation, 
the data are usually promptly available; but the minor ups and 
downs and the varying degree of lead and lag in the barometer as 
compared with general business constitute ever-present diffi- 
culties in their use. The indexes of general business activity 
based upon combinations of several series are less affected by 
difficulties with respect to lead and lag and minor fluctuations, 
but it is often difficult to find a combination of series that are 
promptly reported. The national-income type of indexes suffer 
particularly from the fact that the data are not available cur- 
rently, except for estimates that are being attempted, and these 
are dependent upon other types of data. * 

A unique type of forecasting by historical analogy is employed 
by Roger Babson. The forecasting instrument is the Babson 
index of the physical volume of production. This covers manu- 
facturing production, mineral production, agricultural market- 

1 But see p. 674 for application of leads and lags to the forecasting of 


specific lines of business activity, in which it can be more successfully 
applied. 
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ings, building and construction, railway freight, electric power, 
and foreign trade. The long-run trend of this curve is taken 
as normal, and the cyclical fluctuations are forecasted on the 
mechanical principle that a given action has an equal and 
opposite reaction.! Thus the area of a given period of prosperity 
will indicate the area, but not the shape, of the coming depression. 
The slope of the depression area is forecasted to some extent 
with the help of other series and contemporary crosscut analysis 
of individual lines of activity. 

Forecasting by Crosscut Analysis. Even if considerable 
reliance is placed upon certain forecasting series based upon the 
historical-analogy principle, it would seem desirable to supple- 
ment the analysis by a more detailed study of the current 
situation. This will help to time the forecast better. It will 
also assure the forecaster that the forecasting series continue to 
hold their theoretical significance in the ebb and flow of business. 
The great danger is that the business-cyele theory on which the 
forecasting series are based may become outmoded or may be too 
simple to be fully satisfactory as new conditions unfold. Cross- 
eut, analysis may possibly reveal these defects and help to 
remedy them. 

Some believe that business cycles are unique and that the 
roles played by various economie developments shift, from cycle 
to cycle. If this were true, crosseut analysis would be the only 


logical method of forecasting. Some general theory would neces- 
sarily have to underlie the forecast, even if it were the negative 
theory that all cycles are unique. Nevertheless, it is necessary 
to examine all the important sectors of the economy, to weigh 
their relative importance in the given situation, and to determine 
the net outcome. This requires comprehensive surveys and 
shrewd judgment based on wide experience. 

Such agencies as the Brookmire Economic Service, Standard & 
Poor’s Corporation, and Moody’s Investor’s Service generally 
follow the crosscut method. The Brookmire Economie Service 
watches carefully selected series, such as building construction, 
motorcar output and registration, exchange rates, and industrial 
employment. The importance attributed to the various series 
differs from time to time. Also, new ones are added and old ones 
discarded as the economic situation changes. In all cases where 

1 Seasonal variations in individual series are eliminated. 
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warranted, the Brookmire Economic Service distinguishes care- 
fully between basic trends, seasonal variation, and business 
cycles in its appraisal of the business outlook. The Standard & 
Poor’s Corporation also watches many lines of activity and 
forecasts development in each line. The forecast of general 
business is mainly a summary of these many individual forecasts. 
Moody’s Investor’s Service likewise bases its general forecast 
upon many individual analyses. In making its forecasts, how- 
ever, Moody’s appears to be especially influenced by business- 
men’s anticipations of profits, a factor that receives much empha- 
sis in modern business-cycle theory. 

Forecasting Particular Lines of Activity. The same methods 
are used to forecast particular lines of activity as for general 
business conditions. Crude historical analogy, the use of 
leading series, and crosscut analysis all play their roles. 

Crude Historical Analogy. Figure 155 is an excellent example 
of the use of crude historical analogy in forecasting the course of 
agricultural prices and of wage rates during a long and extensive 
world war. Here the course of agricultural prices and wages in 
the First World War is taken as the pattern for the expected 
course of agricultural prices and wages in the Second World War. 
From the proximity of the two series to each other until the 
beginning of 1943 it would seem that the forecasting power of the 
former series is relatively high. This method is of greater value 
in forecasting particular lines of activity than it is when applied 
to general business conditions, although crosscut analysis might 
modify judgment of this forecast by pointing out that the efforts 
at price control and inflation control in the Second World War 
appear to be more courageous than they were in the First 
World War. 

Lead-Lag Relationships. Figure 156 illustrates the lead-lag 
relationship in forecasting hog production. In raising hogs the 
principal cost is the corn on which the hogs are fed. Further- 
more, the ratio between the amount of corn fed to a hog and his 
weight is fairly constant. Hence, the profitability of hog 
raising is essentially indicated by the so-called “corn-hog dif- 
ferential," which is the difference between the price of 100 pounds 
of hogs and the cost of enough corn to raise 100 pounds of hogs. 
As this differential increases, hog production becomes more 
profitable; as it decreases, hog production becomes less profitable. 
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The increase or decrease in profitability affects hog production 
with several months lag. Hence, changes in the corn-hog dif- 
ferential can be used to forecast changes in hog production, as 
shown in the figure. 
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€ BASED ON DATA FROM FEDERAL RESERVE BANK OF NEW YORK. HOURLY EARNINGS OF INDUSTRIAL LABOR, 
WAQES OF FARM LABOR AND BUILDING TRADES, AND SALARIES OP TEACHERS AND GLERICAL EMPLOYEES 


A ADJUSTED FOR SEASONAL VARIATION 
Fra. 155.—Prices received by farmers and composite wage rates. Index num- 
bers, United States, 1914-1920, and 1939-1943. [From The Agricultural 
Situation, (May, 1943), p. 8, published by the Bureau of Agricultural Economics, 
United States Department of Agriculture.) 


The Cycle Hypothesis. The lag of hog production behind the 
corn-hog differential not only permits forecasting of the former 
but also tends to cause periodic upswings and downswings in the 


two series. The reason for this is as follows: Suppose that the 
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demand for pork increases and, owing to the inability to increase 
rapidly the production of hogs, the corn-hog differential rises. 
This makes hogs more profitable to produce, and their number is 
gradually increased. The lag in response, however, may cause 
the differential to go higher than it would otherwise, and this in 
turn might stimulate a greater increase in production than is 
required to meet the new demand that caused the original rise 


in the ratio. When this enlarged production comes on he 
market, prices fall and the corn-hog differential drops. Owing to 
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Fia. 156.—Hog-corn price ratio and hog marketings, 1901-1942. (From Bureau 
of Agricullural Economics, United States Department of Agriculture.) 


the greatly increased supply, prices go lower than their natural 
level and hog production becomes less profitable for a while. 
The change in profitability causes hog production to drop off, 
and ultimately prices tend to rise again, completing the cycle. 
This existence of a periodic movement in the corn-hog dif- 
ferential and in hog production permits forecasting for some 
distance into the future. If a great war does not interrupt the 
normal course of economic forces, the high and low periods in 
the corn-hog differential ean be predicted with a fair degree of 
accuracy. Wise hog farmers gain considerably from this long- 
range forecasting. Similar periodie movements tend to appear 
in other lines of agriculture in which production lags behind 
price stimuli. For example, the cattle cycle runs about fifteen 
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years, according to studies made by the United States Depart- 
ment of Agriculture. 

Crosscut Analysis. The application of crosscut analysis to 
particular lines of activity is based in many instances on the 
analysis of supply and demand conditions. In agriculture, the 
carry-over and current crop prospects are important factors 
on the supply side. The economic condition of industries or of 
sections of the population using the given product, the prices of 
competing products, and the output of competing areas are 
important factors on the demand side. If the product has 
widespread uses, possibly prediction of changes in consumer 
incomes or in general industrial activity might be the best way of 
forecasting the future demand for it. 

In manufacturing, principal attention is likely to be devoted 
to demand. When the demand is industrial, the forecasting 
takes primarily the form of predicting conditions in those lines 
of activity immediately supplied by the given line of manufac- 
turing, Thus steel production might be forecasted from railroad 
construction and maintenance, automobile production, road 
construction, and building activity. When the product is one 
sold to the consuming public and not to other industries, the 
analysis of demand becomes largely a study of the flow of income 
to consuming areas. ‘This will be dependent on the prosperity 
of important industries in these areas and on the net flow of 
incomes from outside sources. The prices of competing products 
Will also be an important demand factor. 

A statistical technique using multiple and partial correlation, 
mathematical and graphic methods, has been developed for 
making crosseut analyses such as those suggested in the two pre- 
ceding paragraphs. This technique is widely used; in the case 
of many products the multiple- and partial-correlation technique 
makes it possible to derive demand curves that will forecast with 
considerable accuracy “the amount of change in sales that would 
accompany a given contemplated change in price." 


FORECASTS WITH SEASONAL VARIATION 


Forecasting with seasonal variat ion is probably the oldest of 
all types of modern forecasting and is so general as to be common- 


1 Cf. Scnvvrz, HENRY, The Theory and Measurement of Demand (1938). 
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place. It is applied to particular lines of activity more speci- 
fically than to general business conditions. P 

Historical Analogy. Use of historical analogy for forecasting 
with seasonal variation is simpler and more dependable than the 
use of historical analogy for cyclical or trend forecasts. The 
conditions underlying persistent seasonal variations are more 
readily analyzed than are the rational explanations of cycles and 
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Fia. 157.—Mortality from all causes. Metropolitan Life Insurance Company 
industrial department, weekly premium-paying business. [From the Statistical 
a SH 24 (July, 1943), p. 12, published by the Metropolitan Life Insurance 
ompany. 


trends. Moreover, the forecasting is for a shorter period into 
the future and can therefore depend upon conditions remaining 
approximately unchanged pending the outcome of the forecasted 
events. Statistical techniques have been developed for measur- 
ing the dependability of a given seasonal variation.’ 

Figure 157 illustrates the extrapolation of seasonal variation, 
which is the use of historical analogy for making a forecast with 
seasonal variation. From the figure it can be forecast, by 


1 See pp. 631-636. 
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assuming a continued agreement between 1942 and coming 1943 
seasonal movement in mortality from all causes, that the Sep- 
tember death rate per 1,000, annual basis, will be about 7.25, 
the October and November rate about 8.25, and the December 
rate about 8.50. 

Figure 158 is an application of the use of forecasting seasonal 
variation by historical analogy to the field of agricultural eco- 
nomics. Extrapolation of the 1943 curve predicts that income 
from farm marketings in the South Central region of the United 
States will fluctuate around 200 million dollars monthly until 


Millions of Dollars 


I 
Average 1937-41 


JAN, APR JULY oct 
Fra. 158.—Cash income from farm marketings 1942-1943 compared with 
1937-1941 average. [From The Agricultural Situation, Vol. 27 (June, 1943), p. 8, 
published by the Bureau of Agricultural Economics, United States Department of 


Agriculture.) 


July or August; thereafter, monthly cash income from farm 
marketings in that region will rise sharply to a peak in October 
of perhaps 500 million dollars or higher, since the 1943 level 
appears to be a higher average than that of 1942. This figure 
shows the annual average seasonal movement, 1937-1941, which 
gives a somewhat more dependable seasonal indicator than a 
single year’s figures. 

Combining Seasonal with Cyclical Forecasting. Whenever it 
is desired to make forecasts on the basis of a period shorter than 
it is necessary to apply a seasonal forecast along with 
In the case of conventional forecasting 
cycle studies and the resulting barometers, 
and crosscut analysis, discussed in a 
short-period forecasts based 


a year, 
cyclical forecasting. 

by the use of business- 
general business indexes, 
preceding section of this chapter, 


678 FORECASTING 


upon known seasonal variations are used as well as the cyclical 
forecasts. 

Many illustrations could be found of the application of this 
combination of seasonal with cyclical forecasting. Figure 159 
is an illustration in the field of agricultural economics. Based. 
upon statistical forecasting of the cycles in production of live- 
stock, similar to the cycle analysis of hog production already 
outlined, the levels of livestock marketings for 1943 and 1944 
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Fra. 159.— Transportation loads for livestock, estimated on basis of indicated 
marketings and shipments from publie markets, United States, January, 1941- 
March, 1944. [From The Agricultural Situation, Vol. 27 (February, 1943), p. 8, 
published by the Bureau of Agricultural. Economics, United States Department 
of Agriculture.) 


are forecast. The annual amount is then distributed throughout 
the months of the year according to the predetermined index of 
seasonal variation. The figure presents the resulting forecast 
of monthly transportation loads for livestock, estimated from 
indicated marketings and shipments from publie markets in the 
United States. On the same figure are shown the actual amounts 
monthly for the years 1941 and 1942, for purposes of comparison. 
Figures similar to this one for various lines of industrial and 
manufacturing activity appear frequently in such publications 
as the Survey of Current Business and in the publications of the 
various forecasting agencies. 
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THE QUALITY OF FORECASTING 


The success of forecasting is hard to judge. First it is to be 
noted that if an agency declines to make forecasts in difficult 
situations and makes rather limited forecasts in general, it is 
likely to have less failures than one that boldly undertakes to 
forecast on all occasions and in considerable detail. The success 
of a forecasting agency should be judged according to what it 
attempts to do. 

The success of forecasting should also be judged in the light 
of what might be accomplished by mere random guessing. In 
other words, an agency should be right at least 50 per cent of the 
time, or it is worse than useless. Judged on these bases, the 
various economie forecasting agencies have been fairly successful 
in forecasting general business conditions. Although not 
registering anything near a perfect score, they have at least 
been better than chance. 

One of the chief problems of economic forecasting lies in 
the effect; of the forecast itself. The effect of the forecast may 
conceivably be such, on the one hand, that the forecast actually 
causes the forecasted event to occur, or, on the other hand, that 
the forecast actually prevents the forecasted event from occur- 
ring. Whether or not such untoward results are produced 
depends largely on how widely the forecast circulates. On the 
one hand, suppose a forecasting agency predicts a general infla- 
tion of prices and enough people become convinced that the 
forecast is a true one; in such a case, the forecast may not only be- 
come true but be itself the cause of the thing that is forecasted. 
On the other hand, a subscriber to a forecast expects to profit from 
its use, in that his plans will anticipate probable future conditions 
of which a competitor is supposedly not so well informed. The 
fewer who have this information, the more likely it is that they 
will profit and that the forecast will be a true one. But the 
wider the acceptance of the forecast, the less chance the indi- 
vidual subscriber has to gain and the less likely is it that the 
forecast will prove to be true. Suppose, for example, that a 
forecasting agency advises its clients in a given productive 
activity that the price of its product is going to rise as a result of 
some foreseen increase in demand; if too many of the producers 
obtain the forecaster's service and follow its advice, overproduc- 
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tion will result and the price will decline rather than rise. ‘This is 
an illustration of how a forecast might defeat itself. 

In the final analysis, it may be said that the greatest value of 
modern forecasting work lies in the large amount of statistical 
economic analysis that it promotes. Research into the business 
cycle and continued improvements in the statistical approach to 
social and economic problems cannot fail to reveal closer and 
closer approximations to the truth and thereby improve general 
knowledge about economic and social conditions. 
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1 Taken, with permission, from E. V. Huntington’s Four Place Tables of Logarithms and 
Trigonometric Functions (Harvard Cooperative Society, Inc., 1907). 
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1.05 0212 | 0216 | 0220 | 0224 | 0228 | 0233 | 0237" 0241 | 0245 | 0249 | 0253 
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180| 32400] 
190| 36100 


200} 40000} 
210| 44100} 
220| 48400 
230| 52900 
240| 57600) 


250| 62500 
260} 67600) 
270| 72900 
280| 78400) 
290| 84100) 


300| 90000 
310| 96100 
320 | 102400 
330 | 108900| 
340 | 115600| 


350 | 122500] 
360 | 129600 
370 | 136900 
380 | 144400 
390 | 152100 


400 | 160000 
410 | 168100 
420 | 176400] 
430 | 184900) 
440 | 193600 


450 | 202500 
460 | 211600 
470 | 220900) 
480 | 230400 
490 | 240100) 


500 | 250000 
510 | 260100 
520 | 270400) 
530 | 280900 
540 | 291600 


1 Source: WAvcn, ALBERT E., Lab 


10201 
12321) 
14641 
17161) 
19881 


22801 
25921 
29241 
32761 
36481 


40401 
44521 
48841 
53361 
58081 


63001 
68121 
73441 
78961 
84681 


90601 
96721 
103041 
109561 
116281 


123201 
130321 
137641 
145161 
152881 


160801 
168921 
177241 
185761 
194481 


203401 
212521 
221841 
231361 
241081 


251001 
261121 
271441 
281961 
292681 


10404 
12544] 
14884] 
17424 
20164) 


23104 
26244 
29584 
33124! 
36864) 


40804 
44944 
49284 
53824 
58564 


63504! 
68644] 
73984] 
79524 
85262 


91204 
97344] 
103684 
110224 
116964 


123904 
131044 
138384 
145924 
153664 


161604 
169744. 
178084| 
186624 
195364 


204304 
213444 
222784 
232324 
242064 


252004 
262144 
272484 
283024) 
293764 


10609} 
12769 
15129 
17689 
20449 


23409 
26569) 
29929] 
33489 
37249 


41209 
45369 
49729) 
54289 
59049] 


64009) 
69169} 
74529 
80089) 
85849 


91809 
97969} 
104329} 
110889) 
117649} 


124609) 
131769) 
139129} 
146689} 
154449) 


162409) 
170569) 
178929 
187489} 
196249) 


205209 
214369 
223729 
233289| 
243049 


253009 
263169) 
273529 
284089) 
294849 


50176 
54756} 
59536) 


64516, 
69696} 
75076} 
80656} 
86436) 


92416 
98596 
104976 
111556} 
118336} 


125316) 
132496 
139876 
147456 
155236) 


163216) 
171396) 
179776) 
188356) 
197136) 


206116) 
215296) 
224676 
234256 
244036) 


254016) 
264196 
274576 
285156 
295936 


11025 
13225 
15625 
18225} 
21025 


24025 
27225] 
30625} 
34225) 
38025 


42025 
46225} 
50625) 
55225} 
60025 


65025 
70225) 
75625 
81225} 
87025) 


93025 
99225 
105625 
112225 
119025} 


126025; 
133225} 
140625 
148225) 
156025} 


164025) 
172225) 
180625] 
189225) 
198025 


207025) 
216225 
225625) 
235225 
245025 


255025 
265225 
275625) 
286225) 
297025) 


Method (McGraw-Hill Book Company, Inc., 1944). 


11236 
13456} 
15876} 
18496] 
21316} 


24336 
27556] 
30976 
34596) 
38416} 


42436) 
46656 
51076) 
55696} 
60516} 


65536) 
70756) 
76176 
81796 
87616) 


93636} 
99856} 
106276 
112896} 
119716) 


126736 
133956 
141376) 
148996) 
156816) 


164836 
173056 
181476 
190096) 
198916) 


207936 
217150 
226576) 
236196] 
240016 


256036) 
266256 
276676 
287296) 
298116 


11449 
13689) 
16129) 
18769} 
21609) 


24649) 
27889) 
31329] 
34969 
38809 


42849 
47089) 
51529] 
56169] 
61009) 


66049} 
71289) 
76729 
82369) 
88209) 


94249) 
100489) 
106929) 
113569] 
120409 


127449 
134689) 
142129) 
149769) 
157609 


165649 
173889 
182329 
190969 
199809 


208849 
218089 
227529) 
237169 
247009 


257049 
267289 
277729 
288369 
299209 


11664 
13924| 
16384 
19044 
21904 


24964! 
28224) 
31684) 
35344 
39204 


43264 
47524) 
51984 
56644. 
61504 


66564 
71824) 
77284 
82944 
88804 


94864 
101124 
107584) 
114244] 
121104 


128164 
135424 
142884 
150544 
158404 


166464 
174724 
183184 
191844 
200704 


209764. 
219024] 
228484 
238144) 
248004 


258064! 
268324 
278784, 
289444 
300304 


11881 
14161 
16641 
19321 
22201 


25281 
28561 
32041 
35721 
39601 


43681 
47961 
52441 
57121 
62001 


67081 
72361 
GH 
83521 
89401 


95481 
101761 
108241 
114921 
121801 


128881 
136161 
143641 
151321 
159201 


167281 
175561 
184041 
192721 
201610 


210681 
219961 
229441 
239121 
249001 


259081 
269361 
279841 
290521 
301401 


oratory Manual and Problems for Elements of Statistical 


686 


ELEMENTARY STATISTICS AND APPLICATIONS 


H 


TABLE II.—SquaRES OF Numpers.—(Continued) 


4 


6 


7 


303601 
314721 
326041 
337561 
349281 


361201 
373321 
385641 
398161 
410881 


423801 
436921 
450241 
463761 
477481 


491401 
505521 
519841 
534361 
549081 


564001 
579121 
594441 
609961 
625681 


641601 
657721 
674041 
690561 
707281 


724201 
741321 
758641 
776161 
793881, 


811801 
829921 
848241 
866761 
885481 


904401 
923521 
942841 
962361 
982081 


304704 
315844 
327184 
338724 
350464 


362404 
374544 
386884 
399424 
412164 


425104 
438244] 
451584) 
465124 
478864] 


492804. 
506944 
521284] 
535824 
550564 


565504 
580644 
595984] 
611524) 
627264) 


643204 
659344 
675684 
692224] 
708964 


725904 
743044 
760384 
777924 
795664 


813604 
831744 
850084) 
868624) 
887364) 


906304 
925444) 
944784. 
964324. 
984064 


305809 
316969) 
328329) 
339889) 
351649 


363609) 
375769) 
388129) 
400689) 
413449) 


426409) 
439569) 
452929) 
466489) 
480249 


494209) 
508369 
522729) 
537289) 
552049 


567009 
582169 
597529) 
613089} 
628849 


644809 
660969] 
677329] 
693889) 
710649) 


727609. 
744769 
762129 
779689) 
797449 


815409 
833569 
851929 
870489 
889249 


908209 
927369 
946729) 
966289] 
986049 


306916 
318096) 
329476) 
341056) 
352836) 


364816) 
376996, 
389376) 
401956 
414736) 


427716) 
440896 
454276) 
467856) 
481636) 


495616) 
509796 
524176) 
538756) 
553536 


568516 
583696 
599076 
614656) 
630436 


646416) 
662596 
678976, 
695556 
712336 


729316 
746496 
763876) 
781456) 
799236) 


817216 
835396 
853776) 
872356) 
891136 


910116) 
929296] 
948676) 
968256) 
988036 


308025 
319225 
330625 
342225 
354025) 


366025 
378225 
390025 
403225| 
416025| 


429025 
442225 
455625) 
469225) 
483025) 


497025 
511225) 
525625) 
540225 
555025 


570025 
585225) 
600625) 
616225) 
632025 


648025 
664225) 
680625) 
697225) 
714025) 


731025 
748225 
165025 
783225) 
801025 


819025 
837225 
855625 
874225) 
893025) 


912025 
931225) 
950625) 
970225 
990025) 


309136 
320356) 
331776 
343396) 
355216) 


367236) 
379456 
391876 
404496 
417316 


430336 
443556 
456976) 
470596) 
484416 


498436) 
512656) 
527076) 
541696) 
556516 


571536) 
586756) 
602176) 
617796 
633616) 


649636 
665856) 
682276) 
698896) 
715716 


732736 
749956 
767376) 
784996) 
802816 


820836 
839056 
857476) 
876096) 
894916 


913936 
933156) 
952576 
972196 
992016) 


310249) 
321489) 
332929 
344509 
356409 


368449) 
380689 
393129) 
405769) 
418609 


431649) 
444889) 
458329 
471969) 
485809) 


498849) 
514089) 
528529) 
543169) 
558009 


573049 
588289) 
603729) 
619369) 
635209 


651249 
667489 
683929 
700569 
717409 


734449 
751689 
769129 
786769) 
804609) 


822649] 
840889) 
859329] 
877969) 
896809) 


915849 
935089 
954529 
974169 
994009) 


311364 
322624 
334084 
345744 
357004. 


369664) 
381924) 
394384 
407044 
419904) 


432964 
446224 
459684 
473344 
487204) 


501264 
515524 
529984] 
544644 
559504 


574564 
589824 
605284 
620944 
636804 


652864 
669124 
685584) 
702244 
719104 


736164 
753424 
770884 
788544 
806404 


824464 
842724 
861184) 
879844 
898704 


917764) 
937024, 
956484 
976144| 
996004] 


312481 


323761 


370881 
383161 


421201 


34281 
44756 

461041 
474721 
488601 


502681 
516961 
531441 
546121 
561001 


576081 
591361 
606841 
622521 
638401 


654481 
670761 
687241 
703921 
720801 


737881 
755161 
772641 
790321 
808201 


826281 
844561 
863041 
881721 
900601 


919681 
938961 
958441 
978121 
998001 
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Taste IIL—SquanE Roors or Numsers FROM 10 To 1001 


N 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

10 | 3.162 | 3.178 | 3.194 | 3.209 | 3.225 3.240 | 3.256 | 3.271 | 3.286 | 3.302 
11 | 3.317 | 3.332 | 3.347 | 3.362 | 3.376 3.391 | 3.406 | 3.421 | 3.435 | 3.450 
12 | 3.464 | 3.479 | 3.493 | 3.507 | 3.521 3.536 | 3.550 | 3.564 | 3.578 | 3.592 
13 | 3.606 | 3.619 | 3.633 | 3.647 | 3.661 3.674 | 3.688 | 3.701 | 3.715 | 3.728 
14 | 3.742 | 3.755 | 3.768 | 3.782 | 3.795 | 3.808 | 3.821 | 3 834 | 3.847 | 3.860 
15 | 3.873 | 3.886 | 3.899 | 3.912 | 3.924 | 3.937 | 3.950 3.962 | 3.975 | 3.987 
16 | 4.000 | 4.012 | 4.025 | 4.037 | 4.050 4.062 | 4.074 | 4.087 | 4.099 | 4.111 
17 | 4,123 | 4.135 | 4.147 | 4.159 4.171 | 4.183 | 4.195 | 4.207 | 4.219 4.231 
18 | 4.243 | 4.254 | 4.266 | 4.278 | 4.290 4.301 | 4.313 | 4.324 | 4.336 | 4.347 
19 | 4.359 | 4.370 | 4.382 | 4.393 | 4.405 | 4.416 | 4.427 4.438 | 4.450 | 4.461 
20 | 4.472 | 4.483 | 4.494 | 4.506 | 4.517 | 4.528 | 4.539 4.550 | 4.561 | 4.572 
21 | 4.583 | 4.593 | 4.604 | 4.615 4.626 | 4.637 | 4.648 | 4.658 4.669 | 4.680 
22 | 4.690 | 4.701 | 4.712 | 4.722 | 4.733 | 4.743 4.754 | 4.764 | 4.775 | 4.785 
23 | 4.796 | 4.806 | 4.817 | 4.827 | 4.837 | 4.848 4.858 | 4.868 | 4.879 | 4.889 
24 | 4.899 | 4.909 | 4.919 | 4.930 | 4.940 | 4.950 4.960 | 4.970 | 4.980 | 4.990 
25 | 5.000 | 5.010 | 5.020 | 5.030 | 5.040 | 5.050 5.060 | 5.070 | 5.079 | 5.089 
26 | 5.099 | 5.109 | 5.119 | 5.128 5.138 | 5.148 | 5.158 | 5.167 5.177 | 5.187 
27 | 5.196 | 5.206 | 5.215 | 5.225 5.234 | 5.244 | 5.254 | 5.263 | 5 273 | 5.282 
28 | 5.292 | 5.301 | 5.310 | 5.320 5.329 | 5.339 | 5.348 | 5.357 5.367 | 5.376 
29 | 5.385 | 5.394 | 5.404 | 5.413 | 5.422 | 5.431 5.441 | 5.450 | 5.459 | 5.468 
30 | 5.477 | 5.486 | 5.495 | 5.505 | 5.514 5.523 | 5.532 | 5.541 | 5.550 | 5.559 
31 | 5.568 | 5.577 | 5.586 | 5.595 5.604 | 5.612 | 5.621 | 5.630 5.639 | 5.648 
32 | 5.657 | 5.666 | 5.674 | 5.683 | 5.692 | 5.701 5.710 | 5.718 | 5.727 | 5.736 
33 | 5.745 | 5.753 | 5.762 | 5.771 | 5.779 5.788 | 5.797 | 5.805 | 5.814 | 5.822 
34 | 5.831 | 5.840 | 5.848 | 5.857 | 5.865 5.874 | 5.882 | 5.891 | 5.899 | 5.908 
35 | 5.916 | 5.925 | 5.933 | 5.941 | 5.950 | 5 958 | 5.967 | 5.975 | 5.983 | 5.992 
36 | 6.000 | 6.008 | 6.017 6.025 | 6.033 | 6.042 | 6.050 6.058 | 6.066 | 6.075 
37 | 6.083 | 6.091 | 6.099 6.107 | 6.116 | 6.124 | 6.132 6.140 | 6.148 | 6.156 
38 | 6.164 | 6.173 | 6.181 | 6.189 | 6.197 6.205 | 6.213 | 6.221 | 6.229 | 6.237 
39 | 6.245 | 6.253 | 6.261 | 6.269 | 6 277 | 6.285 | 6.293 | 6.301 | 6.309 | 6.317 
40 | 6.325 | 6.332 | 6.340 | 6.348 | 6.356 6.364 | 6.372 | 6.380 | 6.387 | 6.395 
41 | 6.403 | 6.411 | 6.419 6.427 | 6.434 | 6.442 | 6 450 | 6.458 | 6.465 | 6.473 
42 | 6.481 | 6.488 | 6.496 | 6.504 | 6.512 6.519 | 6.527 | 6.535 | 6.542 | 6.550 
43 | 6.557 | 6.565 | 6.573 | 6.530 | 6.588 6.595 | 6.603 | 6.611 | 6.618 | 6.626 
44 | 6.633 | 6.641 | 6.648 | 6.656 | 6.663 6.671 | 6.678 | 6.686 | 6.693 | 6.701 
45 | 6.708 | 6.716 | 6.723 | 6.731 6.738 | 6.745 | 6.753 | 6.760 | 6.768 | 6.775 
46 | 6.782 | 6:790 | 6.797 | 6.804 | 6.812 6.819 | 6.826 | 6.834 | 6.841 | 6.848 
47 | 6.856 | 6.863 | 6.870 | 6.878 6.885 | 6.892 | 6.899 | 6.907 | 6.914 | 6.921 
48 | 6.928 | 6.935 | 6.943 6.950 | 6.957 | 6.964 6.971 | 6.979 | 6.986 | 6.993 
49 | 7.000 | 7.007 | 7.014 | 7.021 7.029 | 7.036 | 7.043 | 7.050 | 7.057 7.064 
50 | 7.071 | 7.078 | 7.085 | 7.092 7.099 | 7.106 | 7.113 | 7.120 | 7.127 | 7 134 
si | 7.141 | 7.148 | 7.155 | 7.162 | 7.169 7.176 | 7.183 | 7.190 | 7.197 | 7.204 
52 | 7.211 | 7.218 | 7.225 4.232 | 7.239 | 7.246 7.258 | 7.259 | 7.266 | 7.273 
53 | 7.280 | 7.287 | 7.294 7.301 | 7.308 7.314 | 7.321 | 7.328 | 7.335 7.342 
54 | 7.348 | 7.355 | 7.362 | 7.369 7.376 | 7.382 | 7.389 | 7.396 | 7.403 7.409 


Source: WAvan, Arserr E., Laboratory Manual and Problems for Elements of Statistical 
Method (McGraw-Hill Book Company, Ine., 1944). 
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Tasis IIl.—SqvanE Roors oF NUMBERS FROM 10 TO 100.— (Continued) 


N| 00 0.9 
55 | 7.416 7.477 
56 | 7.483 7.543 
57 | 7.550 7.609 
58 | 7.616 7.675 
59 | 7.681 7.740 
60 | 7.746 7.804 
61 | 7.810 7.868 
62 | 7.874 7.931 
63 | 7.937 7.994 
64 | 8.000 8.056 
65 | 8.002 8.118 
66 | 8.124 8.179 
67 | 8.185 8.240 
68 | 8.246 8.301 
69 | 8.307 8.301 
70 | 8.307 8.420 
71 | 8.426 8.479 
72 | 8. 8.538 
73 | 8. 8.597 
74 | 8. 8.654 
75 | 8. 8.712 
76 | 8. 8.769 
77 | 8. 8.826 
78 | 8. 8.883 
79 | 8. 8.939 
80 | 8. 8.994 i 
81 | 9. 9.050 
82 | 9- 9.105 
83 | 9. 9.160 
84 | 9. 9.214 
85 | 9. 9.268 
s6 | 9. 9.322 
Be 9.376 
ss | 9. 9.429 
89 | 9. 9.482 
BR: 9.534 
9 | 9. 9.586 
92 | 9. 9.638 
POEs 9.690 
ics On 9.742 
95 | 9. 9.793 
96 | 9. disi 
97 | 9. 9.804 
98 | 9. : 1 f 3 9.945 
99 | 9.950 | 9, à A ; 9.995 
NONEM "ros pu 
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Tase IV.—SqvaAnE Roors or Numpers FROM 100 ro 1000: 


1 Source: Wavan, ALBERT E., Laboratory Manual and Problems for Elements of Statistical 
Method (McGraw-Hill Book Company, Inc., 1944), 
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Tasun IV.—Square Roors or Numpers FROM 100 ro 1000.—(Continued) 
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TABLE V.—Recrprocats or NUMBERS! 


el 
N | .0 | .o2 | .02 | .03 | .04 | .05 | .06 | .o7 | .08 | .09 
1.00/1.0000 | .9901 | .9804 | .9709 | .9615 | .9524 | .9434 | .9340 | .9259 | .9174 
1.10] .9091 | .9009 | .8929 | .8850 | .8772 | .8696 | .8621 | .8547 | .8475 | .8403 
1.20 .8333 | .8264 | .8197 | .8130 | .8065 | .8000 | .7937 | .7874 | .7812 | .7752 
1.30| .7692 | .7634 | .7576 | .7519 | .7463 | .7407 | .7353 | .7299 | .7246 | .7194 
1.40| .7143 | .7092 | .7042 | .6993 | .6944 | .6807 | .6849 | .6803 | .6757 | .6711 
1.50! .6667 | .6623 | .6579 | .6536 | .6494 | .6452 | .6410 | .6369 | .6329 | .6289 
1.60] .6250 | .6211 | .6173 | .6135 | .6098 | .6061 | .6024 | .5988 | .5952 | .5017 
1.70| .5882 | .5848 | .5814 | .5780 | .5747 | .5714 | .5682 | .5650 | .5018 | .5587 
1.80| .5556 | .5525 | .5495 | .5464 | .5435 | .5405 | .5376 | .5348 | .5319 | .5291 
1.90 .5263 | .5236 | .5208 | .5181 | .5155 | .5128 | .5102 | .5076 | .5051 | .5025 
2.00| .5000 | .4975 | .4950 | .4926 | .4902 | .4878 | .4854 | .4831 | .4808 | .4785 
2.10| .4762 | .4739 | .4717 | .4694 | .4673 | .4651 | .4630 | .4608 | .4587 | .4566 
2.20| 4545 | 4525 | .4504 | .4484 | .4464 | .4444 | .4425 | .4405 | .4380 | .4307 
2.30| .4348 | .4329 | .4310 | .4292 | .4274 | .4255 | .4237 | .4219 | .4202 | .4184 
2.40  .4107 | .4149 | .4132 | .4115 | .4098 | .4082 | .4065 | .4049 | .4032 | .4016 
9.50 .4000 | .3984 | .3968 | .8953 | .3937 | .3922 | .3906 | .3801 | .3876 | .3861 
2.00 .3846 | .3831 | .3817 | .3802 | .3788 | .3774 | .3759 | .8745 | .3731 | .3717 
2.70 .3704 | .3690 | .3676 | .3663 | .3650 | .3636 | .3623 | .3010 | .3597 | .3584 
2.80 .3571 | .3550 | .3546 | .3534 | .3521 | .3509 | .3496 | .3484 | .3472 | .3460 
2.90 .3448 | .3436 | .3425 | .3413 | .8401 | .3390 | .3378 | .3367 | .3350 | .3344 
3.00] .3333 | .3322 | .3311 | .3300 | .3289 | .3279 | .3208 | .3257 | .3247 | .3236 
3. 10| .3226 | .3215 | .3205 | .3195 | .3185 | .3175 | .3165 | .3155 | .3145 | .3135 
3 20| 3125 | .3115 | .3106 | .3096 | .3086 | .3077 | .3067 | .3058 | .3049 | .3040 
3.30  .3030 | .3021 | .3012 | .3003 | .2094 | .2085 | .2976 | .2907 | .2959 | .2050 
3.40| .2041 | .2083 | .2924 | .2015 | .2907 | .2809 | .2890 | .2882 | .2874 | .2865 
am .2857 | .2849 | .2841 | .2833 | .2825 | .2817 | .2809 | .2801 | .2703 | .2786 
3.60| .2778 | .2770 | .2762 | .2755 | .2747 | .2740 | .2732 | .2725 | .2717 | .2710 
3.70  .2703 | .2095 | .2688 | .2081 | .2074 | .2667 | .2660 | .2653 | .2646 | 2000 
3.80| 2032 | .2625 | .2018 | .2611 | .2604 | .2507 | .2501 | .2584 | .2577 | .2571 
3.90| .2504 | .2558 | .2551 | .2545 | .2538 | .2532 | .2525 | .2519 | .2518 | .2506 
4.00.2500 | .2494 | .2488 | .2481 | .2475 | .2469 | .2403 | .2457 | .2451 | .2445 
4.10 2439 | .2433 | .2427 | .2421 | .2415 | .2410 | .2404 | .2398 | .2302 | .2387 
a 2381 | .2375 | .2370 | .2364 | .2358 | .2353 | .2347 | .2942 | .2330 | .2331 
4.30, 2326 | .2320 | .2315 | .2309 | .2304 | .2209 | .2204 | .2288 | .2283 | .2278 
Aal 2273 | .2208 | .2262 | .2257 | .2252 | -2247 | .2242 | .2237 | .2282 | .2227 
4.50 .2222 | .2217 | .2212 | .2208 | .2203 | .2198 | .2193 | .2188 | .2183 | .2179 
el .2174 | .2169 | .2164 | .2160 | .2155 | .2151 | .2146 | .2141 | .2187 | .2132 
vol 2128 | .2123 | .2119 | .2114 | .2110 | .2105 | .2101 | .2096 | -2092 | .2088 
al 2083 | .2079 | .2075 | .2070 | .2066 | .2062 | .2058 | -2053 | -2049 | .2045 
Cal 2041 | 2037 | .2033 | .2028 | .2024 | .2020 | .2016 | .2012 | .2008 | .2004 
5.00| .2000 | .1996 | -1992 | .1988 | -1984 | .1980 | -1976 | .1972 | .1968 | .1965 
5.10| 1961 | .1957 | .1953 | .1949 | .1946 | .1942 | .1938 | .1934 | .1930 | .1927 
5.20| 11923 | 11919 | .1916 | .1912 | -1908 | .1905 | .1901 | .1898 | .1804 | .1890 
5.30| 1887 | .1883 | .1880 | -1876 | .1873 | .1809 | .1866 | .1862 | .1859 | .1855 
GC 1852 | .1848 | .1845 | .1842 | .1838 | .1836 | .1832 | .1828 | .1825 | .1821 


1 Source: Waren, ALBERT E., Laboratory 


Method (McGraw-Hill Book Company, Inc., 1944), 
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TABLE V.—REcCIPROCALS OF 


Numpers.— (Continued) 


| 


anman 


PEEL NEED 


eooco 


secco 


EECH 


eris E E MTS 
BS858 88383 55858 SSSSE 


wow» 


mooo o 


REES 


(€: 005 


29999 
SE388 


.00 BO .02 .03 BO? .05 .06 07 .08 
.1818 | .1815 | .1812 | .1808 | .1805 | .1802 | .1799 .1795 | .1792 
.1786 | .1783 | .1779 | .1776 | .1773 | .1770 | .1707 .1764 | .1761 
.1754 | .1751 | .1748 | .1745 | .1742 | .1739 | .1736 | .1733 .1730 
.1724 | .1721 | .1718 | .1715 | .1712 | .1709 | .1706 .1704 | .1701 
.1695 | .1692 | .1689 | .1686 | .1684 | .1681 | .1678 | .1075 | .1072 
.1007 | .1664 | .1661 | .1658 | .1656 | .1053 | .1650 | .1647 .1645 
.1639 | .1637 | .1634 | .1631 | .1029 | .1626 | .1623 | .1621 .1618 
.1613 | .1610 | .1608 | .1605 | .1603 | .1600 | .1597 | .1595 | .1592 
.1587 | .1585 | .1582 | .1580 | .1577 | .1575 | .1572 | .1570| . 1567 
.1562 | .1500 | .1558 | .1555 | .1553 | .1550 | .1548 | .1546 | . 1543 
.1538 | .1536 | .1534 | .1531 | .1529 | .1527 | .1524 | .1522 | .1520 
.1515 | .1513 | .1511 | .1508 | .1506 | .1504 | .1502 | .1499 | .1497 
.1493 | .1490 | .1488 | .1486 | .1484 | .1481 | .1479 | .1477 1475 
.1471 | .1468 | .1466 | .1404 | .1462 | .1460 | .1458 | .1456 | . 1453 
1449 | .1447 | .1445 | .1443 | .1441 | .1439 | .1437 | .1435 | .1433 
.1429 | .1427 | .1424 | .1422 | .1420 | .1418 | .1416 | .1414 | .141 
.1408 | .1406 | .1404 | .1403 | .1401 | .1399 | .1397 1395 | .1393 
.1389 | .1387 | .1385 | .1383 | .1381 | .1379 | .1377 | .1376 | .1374 
.1370 | .1368 | .1366 | .1364 | .1362 | .1361 | .1359 | .1357 | .1355 
.1351 | .1350 | .1348 | .1340 | .1344 | .1242 | .1340 | .1339 | .1337 
.1333 | .1332 | .1330 | .1328 | .1326 | .1324 | .1323 
.1316 | .1314 | .1312 | .1311 | .1309 | .1307 | .1305 
.1299 | .1207 | .1295 | .1294 | .1292 | .1290 | .1289 
.1282 | .1280 | .1279 | .1277 | .1276 | .1274 | .1272 
.1200 | .1204 | .1203 | .1261 | .1259 | .1258 | .1256 
.1250 | .1248 | .1247 | .1245 | .1244 | .1242 | .1241 | .1239 | .1238 
.1235 | .1233 | .1232 | .1230 | .1228 | .1227 | .1225 | .1224 | .1222 
.1220 | .1218 | .1217 | .1215 | .1214 | .1212 | .1211 | .1209 1208 
.1205 | .1203 1202 | .1200 | .1199 | .1198 | .1196 | .1195 | .1193 
.1190 | .1189 | .1188 | .1186 | .1185 | .1183 | .1182 | .1181 | .1179 
.1176 | .1175 | .1174 | .1172 | .1171 | .1170 | .1168 | .1167 | .1166 

1163 | .1161 1160 | .1159 | .1157 | .1156 | .1155 | .1153 1152 

1149 | .1148 | .1147 | .1145 | .1144 | .1143 | .1142 | .1140 | .1139 
.1136 | .1135 | .1134 | .1132 | .1131 | .1130 | .1129 | .1127 | .1126 
.1124 | .1122 | .1121 | .1120 | .1119 | .1117 | .1116 | .1115 | .1114 
.1111 | .1110 | .1109 | .1107 | .1106 | .1105 | .1104 | .1103 | .1101 
.1099 | .1098 | .1096 | .1095 | .1094 | .1093 | .1092 | .1091 | .1089 
.1087 | .1086 | .1085 | .1083 | .1082 | .1081 | .1080 | .1079 | .1078 
.1075 | .1074 | .1073 | .1072 | .1071 | .1070 | .1068 | .1067 | .1066 
.1004 | .1063 | .1062 | .1060 | .1059 | .1058 | .1057 | .1056 | .1055 
.1053 | .1052 | .1050 | .1049 | .1048 | .1047 | .1046 | .1045 | .1044 
.1042 | .1041 | .1040 | .1038 | .1037 | .1036 | .1035 | .1034 1033 
.1031 | .1030 | .1029 | .1028 | .1027 | .1026 | .1025 | .1024 1022 
.1020 | .1019 | .1018 | .1017 | .1016 | .1015 | .1014 | .1013 | .1012 
.1010 | .1009 | .1008 | .1007 | .1006 | .1005 | .1004 | .1003 | .100: 
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TABLE VI.—AREAS UNDER THE RMAL CURVE 


Fractional parts of the total area (1.000) under the normal curve between 
the mean and a perpendicular erected at various numbers of standard 
deviations (z/s) from the mean.! To illustrate the use of the table, 39.065 
per cent of the total area under the curve will lie between the mean and a 
perpendicular erected at a distance of 1.230 from the mean. 
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1 This table has been adapted, by permission, 


Each figure in the body of the table is preceded by a decimal point. 


,04 | .05 | .06 | .07 | .08 | .09 

01595| 01994| 02392 A. 03188) 03586 

05567| 05962| 06356| 06749) 07142 07535 

09483| 09871| 10257| 10642| 11026) 11409 

13307| 13683] 14058) 14431) 14803 15173 

17003| 17364| 17724| 18082| 18439| 18793 
19146 20450| 20884| 21226] 21566| 21904) 22240 
22575 23891| 24215| 24537| 24857| 25175) 25490 
25804 27035| 27337| 27637| 27935| 28230) 28524 
28814 29955| 30234| 30511| 30785 31057| 31327 
31594 32639| 32894| 33147, 33398| 33646| 33801 
34134 35083] 35313| 35543| 35769| 35993| 30214 
36433 37286| 37493| 37698| 37900| 38100| 38208 
38493 39251] 39435| 39617| 39796| 39973) 40147 
40320 40988] 41149| 41308| 41466| 41621| 41774 
41924 42507| 42647| 42786| 42922| 43056| 43189 
43319 43822| 43943| 44062| 44179| 44205) 44408 
44520 44950| 45053| 45154| 45254| 45352| 45449 
45543 45907| 45994| 46080| 46164| 46246) 46327 
46407 40712| 46784| 46856] 46926] 46095) 47062 
47128 47381| 47441| 47500] 47558} 47615) 47670 
47725 47932| 47982| 48030| 48077| 48124| 48160 
48214 48382| 48422| 48461| 48500] 48537| 48574 
48610 48745| 48778| 48800| 48840) 48870) 45809 
48928 40030| 49061| 49086| 49111| 49134) 49155 
49180 49266| 49286| 49305| 49324) 49343) 49361 
49379 49440| 49461| 40477| 40492] 49506) 49520 
49534 49585| 49598] 49609| 49621| 49632 49643 
49653 49693| 49702| 49711 49720| 49728) 49736 
49744 49774| 49781| 49788 49795) 49801| 49807 
49813 49836| 49841 49846) 49851) 49856 49861 


from F. C. Kent, " Elements of Statistics" 


(McGraw-Hill Book Company, Inc., 1924). 
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'TABtE VII—ORpINATES OF THE NormaL Curve 
Ordinates (heights) of the standard normal eurve,! The height (y) at 
any distance (x) from the mean is 


xt 

y = 0.39894 2 
To make the curve fit a histogram in which the abscissa scale is measured 
in original z units instead of standard-deviation (z/s) units, multiply these 
ordinates by Ni/c where N is the number of cases, ? the class interval, and 
c the standard deviation. b 
Each figure in the body of the table is preceded by a decimal point. 


8 
> 


| 


39870| 39862| 39844) 39822| 39797 39733 
39559| 39505| 39448| 39387 39322| 39253| 39181 
38853| 38762| 38667| 38508| 38466) 38361 38251 
87780| 37654| 37524) 37391] 37255) 37115) 36973 
36371| 36213| 36053) 35889 35723| 35553| 35381 


39894 
39695 
39104 
38139 
30827 


s9992 
eU 


35207 
33322 
31225 
28969 
26609 


34667| 34482| 34294| 34105) 33912) 33718 33521 
32713| 32500| 32297| 32080| 31874| 31659 81443 
30563| 30339| 30114) 29887| 29658| 29430 29200 
28269| 28034| 27798| 27562| 27324| 27086 26848 
25888| 25647| 25406| 25164| 24923| 24681 24439 


EE 
uou 


1.0 24197 93471| 93230| 22988| 22747| 22506| 22265 22025 
11 21785 21069| 20831| 20594| 20357| 20121| 19886 19652 
1.2 19419 18724| 18494| 1826¢| 18037| 17810) 17585 7360 
1.3 17137 10474| 16256| 16038] 15822) 15608) 15395 15183 
1.4 14973 14350, 14140| 13943| 13742| 13542| 13344 13147 
-— - 
1.5 12952 12376| 12188| 12001] 11816| 11632) 11450 11270 
1.6 11092 10567| 10396] 10226} 10059) 09893) 09728 09566 
1.7 09405 08933] 08780| 08628| 08478| 08329) 08183 08038 
1.8 07895 07477| 07341| 07206| 07074| 06943| 06814) 06687 
1.9 06562 06195] 06077| 05959| 05844) 05730 05618 05508 
2.0 05399 05082] 04980| 04879| 04780| 04682| 04586) 04491 
2.1 04398 04128| 04041| 03955| 03871| 03788) 03706) 03626 
2.2 03547 03319| 03240| 03174| 03103| 03034| 02965 02898 
2.3 02833 02643] 02582] 02522] 02463) 02406) 02349| 02204 
2.4 02239 02083| 02053| 01984| 01936) 01888| 01842 01797 


01753 
01358 
01042 
00792 
00595 


01625| 01585| 01545| 01500| 01468) 01431) 01394 
01256| 01223| 01'191| O1160| 01130| 01100] 01071 
00961| 00935] 00909] 00885) 00861] 00837| 00814 
00727] 00707| 00687| 00668] 00649) 00031, 00613 
00545| 00530| 00514| 00499 00485  00470| 00457 


wd n jf o 
CMON an 


00443 $ 
0008727 
0001338 


Da astan ski 
Onon 


1 This table adapted, by permission, from Kent, “Elements of Statistics, 

` Ordinates may also be computed from the equation logy = 9.600910 — 10 — 0.217147 z* 
and for ordinates beyond 3e it would be necessary to use log y = 9,6009100658 — 10 — 
0.2171472910z*. 
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Lupp VII.—Hyerrsonic TANGENTS! 


————D s r A 
z r = tanh z z r = tanh z z r = tanh z 
0.00 0.00000 0.55 0.50052 1.10 0.800 
0:10 :01000 0:56 -50798 ial “80408 
0.02 02000 0.57 -51536 1:12 180757 
0:03 102999 0.58 -52267 1.13 -81102 
0.04 03998 0:59 152 1.14 -81441 
0.05 0.04996 0.60 0.58705 1.15 0.81775 
0.06 05993 0:61 154413 1:16 -82104 
0.07 06989 0:62 155113 1.17 -82427 
0.08 107983 0. -55805 1:18 SCH 
0:09 108976 0:64 -56490 1.19 -83058 
0.10 0.09967 0.65 0.57167 1.20 0.83365 
0.11 110956 0:06 257 1:21 E 
0.12 11943 0.67 58498 1.22 183965 
0.13 -12927 0.68 -59152 1.23 184258 
0.14 13909 0.69 :59798 1.24 184546 
0.15 0.14880 0.70 0.60437 1.25 0.84828 
0.16 -15865 0:71 161068 1.26 -85106 
0.17 16838 0.72 161691 1.27 -85380 
0.18 17808 0.73 162307 1:28 :85048 
0:19 18775 0.74 162015 1:29 185013 
0.20 0.19738 0.75 0.63515 1.30 0.86172 
0.21 20697 0.76 164108 1.31 18642 
0.22 21652 0.77 64693 1.32 180078 
0:23 22603 0.78 65271 1.33 180025 
0:24 23550 0.79 65841 1.34 87167 
0.25 0.24492 0.80 0.66404 1.35 0.87405 
0.26 125430 0.81 166959 1.36 -87639 
0.27 -20362 0.82 167507 1.97  BT809 
0.28 27291 0.83 168048 1.88 -88095 
70:29 28213 0.84 68581 1.39 -88317 
0.30 0.29131 0.85 0.69107 1.40 0.88535 
0.31 130044 20,86 6962 1.41 88749 
0:32 180951 0.87 70137 1:42 138960 
0.33 31852 0.88 70842 1.43 -80167 
0.34 32748 0.89 71139 1:44 -89370 
0.35 0.33038 0.90 0.71630 1.45 0.89569 
0.38 34521 0.91 72113 1.46 -80765 
0.37 35399 0.92 72590 1.47 -89958 
0.38 36271 0.93 73059 1.48 -90147 
0.39 37136 0.94 73522 1.49 190332 
0.40 0.37995 0.95 0.73978 1.50 0.90515 
- 0:41 38847 0.96 74428 1.51 
-0.42 E 0.97 174870 1.52 190870 
0.43 0.98 :15307 1.58 191042 
0.44 nn 0.99 75136 1.54 191212 
e o | o 0.76159 1.55 0.91379 
0:48 0-008 1.01 76576 1.56 191542 
0.47 .43820 1.02 76987 1.57 .91703 
0.48 144624 1.03 77391 1.58 191860 
0.49 145422 1.04 -77789 1.59 192015 
S .46212 1.05 0.78181 1.60 0.92167 
0:31 0140995 1:06 -78566 1.61 192316 
0.52 ‘47770 1.07 BE 1.62 192462 
0:53 48538 1.08 179320 1.63 192606 
0.54 149290 1.09 T9088 | 1.64 .92747 


'HARLES c. Mathematical Tables from Handbook of Chemistry ond 


1 Source: Hopeman, C! 
Physics (1941). 
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Taste VIIl.—Hyrersonic TaxcGENTS.—(Continued) 


z r = tanh z z r = tanh z z r = tanh z 
1.65 0.92886 0.97574 2.75 0.99186 
1:66 93022 197622 2.76 :99202 
1:67 93155 2.77 99218 
1.68 93286 2.78 99233 
1,69 93415 2.79 99248 
1.70 0.93541 2.80 0.99263 
1.71 93665 2.81 99278 
1.72 93786 2.82 99292 
1:18 93906 2.83 199306 
1:74 94023 2.84 £99320 
1.75 0.94138 2.85 0.99333 
1.76 94250 2.86 .99346 
1.77 94361 2.87 -99359 
1.78 94470 2.88 199872 
1.79 94576 2.80 .99384 
1.80 0.94081 2.90 0.99396 
1.81 .94783 2:91 99408 
1.82 :94884 2.92 99420 
1.83 194983 2.93 99431 
1:84 -95080 2.94 99443 
1.85 2.95 0.99454 
1,86 2.96 99464 
1.87 2:97 99475 
1.88 2:98 0485 
1.89 2.99 99496 
1.90 3.0 0.99505 
1.91 3.1 -99595 
1.92 3.2 99668 
1.93 8.8 99728 
1.94 3.4 99777 
1.95 3.5 0.99818 
1.96 3.6 99851 
1.97 3.7 99878 
1.98 3.8 99900 
1.99 3.9 99918 
2.00 4.0 0.99933 
2.01 4.1 -99945 
2:02 sa -99955 

` Ç 99963 
2.04 4.4 99970 
2.05 4.5 0.99975 
2.06 4.6 99980 
2:07 4.7 -99983 
2.08 4.8 -99986 
2.09 4.9 £99989 
2.10 D 1 
paN 0 0.99991 

z Su 
2.13 
2.14 
2.15 
2.16 
2.17 
2.18 
2.19 
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A 


Accuracy, in calculating statistics, 
230-231 
Agricultural Situation, 80 
Agricultural Statistics, 80 
U.S. Department of Agriculture, 
80 
Bureau of Agricultural Eco- 
nomics, 80-81 
Agricultural Situation, 80 
Agricultural Yearbook, 80 
Crops and Markets, 80 
American Bankers Association, 51 
Analysis of variance, in mutliple cor- 
relation, 422-429 
in nonlinear correlation, correla- 
tion index, 395-396 
correlation ratio, 373-376 
in simple correlation, 352-353 
Annalist, 534 
Arithmetie charts, 129-131 
Array, 139-140 
Asymmetry (see Skewness) 
Attributes, variable, 157 
Averages (see Frequency distribu- 
tions, averages) 
Avogadro's law; 57 


B 
D 
Banking statistics, sources of, 79- 
86 
Federal Reserve, Board of Gover- 
nors, 82 
Federal Reserve Bulletin, 82 
Member Bank Call Report, 82 
National Monetary Commission, 
83 
Statesman’s Yearbook, 86 


Banking statistics, U.S. Treasury 
Department, 79 
Abstract of Condition of National 
Banks, 79 
Bar charts, 104-105 
Bayes, T., 242 
Bernoulli, Daniel, 242 
Bernoulli, Jacques, 242 
Beta coefficient, 192-193 
Beta cross-product term, 425 
Biennial Census of Manufactures, 
537 
Binomial distribution, symmetrical 
(see Symmetrical binomial dis- 
tribution) 
Bivariate frequency distribution, 
325-353 
first-order standard deviation, 
relation to r, 351-353 
illustration of, 325-327 
table, 326 
joint variation jllustrated (bivari- 
ate scatter diagrams), 339, 
343, 345 
methods of summarization and 
comparison, 327-353 
Pearsonian coefficient of correla- 
tion, 338-349 
analysis of variance, 352-353 
calculation of, 347, 349 
progressions of means, 328-329 
illustrated (graphs), 328-329 
Bivariate frequency surface, 471—486 
bivariate histogram, 469-471 
illustrated — (three-dimensional 
diagram), 470 
independent variables, 471 
lines of regression, 486488 
mathematical representation 
487-488 
nonnormal, 491-492 
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702 


Bivariate frequency surface, non- 
normal, product-moment for- 
mula for r, cases for use or 
nonuse, 491—492 

normal, dependent variables, 477- 
486 
derivation of equation, 481—486, 
492—496 
equation of rotated ellipse, 481- 
482 
horizontal cross section, 481 
horizontal view (graph), 477 
illustrated (graph), 480 
mathematical representation, 
482-486 
rotation and narrowing with 
correlation, 483-484 
vertical view, 478 
normal, independent variables, 
472-476 
circular form with equal stand- 
ard deviations, 475-476 
illustrated, 472 
normal curve from which de- 
rived (graph), 473 
elliptical form with unequal 
standard deviations, 476 
horizontal section, 476 
illustrated (graph), 475 
mathematical representation, 
473-476 
Bivariate histogram, 469-471 
illustrated (three-dimensional dia- 
gram), 470 

Bivariate scatter diagram, 365 

Bivariate series, 149-154 

Boltzmann, L., 19 

Boscovich, R. G., 242 

Boyle’s law, 19, 652 

Bradstreet’s index, 522 

Bureau of Foreign and Domestic 
Commerce, 56, 76-77, 512, 535 

Bureau of Home Economics, 48 

Bureau of Labor Statistics, 7, 42-50, 
54, 500, 535, 538 

indexes, 517, 525, 527 
Bureau of Mines, 79 
Business barometers (see Indexes) 
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c 


Cartograms, 112-121 
by bars, 118-119 
by colors and shades, 121 
by cross-hatching, 116-117, 121 
by dots or points, 112-115, 117— 
118, 120-121 
Charles’ law, 57 
Charlier check, 209, 354-355 
Charts, 100-121 
arithmetic, 129-131 
bar, 104-105 
bivariate, 150-154 
component-bar, 106-107 
cross-hatched zone, 107-109 
of frequency distributions, 143- 
149 
frequency polygon, 143-147 
histogram, 147 
on a ratio scale, 147-149 
pictogram, 102-103 
ratio, 131-137 
logarithms in relation to, 133- 
137 
sectors of circles, 104, 109-112 
split-bar, 110-112 
of time series, 128-138 
time series in relatives, 130 
types of, 101 
Chi square (x?) curve, 300 
Chi square (x?) test of goodness of 
fit, 300-305 
the x? curve, 300 
critical values for » ey 
(table), 304 
weaknesses of test, 305 
Class interval, 144, 164 
Classical concept of probability, 
242-247 
Coefficient, confidence, 311-312 
moment, 180 
of multiple correlation, 398, 416— 
418 
of partial correlation, 418-422 
of risk, 310-311 
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Coefficient of correlation, arithmeti- 
cal view of, 339-347 
Charlier check, 354-355 
computation from grouped data, 
357-362 
short method, 357, 359-362 
tabulation of given data (table), 
356 
computation from ungrouped 
data, 354-357 
work sheet (table), 355 
distinguished from correlation 
ratio, 365 
first-order, 422 
order of, 422 
Pearsonian, 339 
relationship to line of regression, 
349-351 
second-order, 422 
third-order, 422 
zero-order, 422 
Combinations, 233-236 
binomial expansion in, 234-236 
defined and illustrated, 233-234 
Combinatorial analysis, problem in, 
270-283 
Commercial and Financial Chronicle, 
70 
Commercial statisties (see Sources 
of statistical data, commercial) 
Commodity Yearbook, 71 
Component-bar charts, 106-107 
Confidence coefficient, 311-312 
Confidence interval, 313 
Consumers’ Incomes in the United 
States, 48 
Correlation, applications of, by 
social scientists, 324-325 
best way of studying, 365 
bivariate frequency table, 357 
coefficient of, Pearsonian, 339 
zero-order, first-order, second- 
order, etc., 422 
multiple (see Multiple correlation) 
nonlinear, 365-396 
(See also Curvilinear regres- 
sion) 
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Correlation, origin and development 
of measurement of, 322-324 
partial (see Partial correlation) 

progress in discovery of, 321-322 
ratio, 365-376 
simple, 321-364 
Correlation coefficient (see Coeffi- 
cient of correlation) 
Correlation index, 394-396 
Correlation ratio, calculation of, 
368-373 
explained, 365-308 
Cournot, A., 12 
Covariance, 405 
Coxe, Tench, 72 
Curve, error, 294-295 
frequency, theoretical significance 
of, 162-166 
Gaussian error, 194, 294-295 
growth ys. frequency, 149 
normal, characteristics of, 265 
formula for, 263-267 
method of fitting to sample 
histogram, 299-300 
normal frequency, 232-320 
characteristics of, 265 
formula for, 263-267 
(See also Normal frequency 
curve) 
probability, 254 
of regression, 367 
standard normal, characteristic of, 


266-207 
formula for, 267 
Curve fitting, eurvilinear regres- 


sions, 376-397 
fitting normal curve, 299-300 
fitting trends to time series, 564- 
616 
Curvilinear regression, calculation 
of, 376-394 
correlation index, 394-396 
and analysis of variance, 395-396 
correlation ratio, and analysis of 
yariance, 373-376 
calculation of, 368-373 
work sheet (table), 371 
explained, 365-368 
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Curvilinear regression, estimates 
based on regression equations, 
388-390 

illustrated, by bivariate scatter 
diagram and fitted curve, 377 
relationship in logarithmic form 
(graph), 378 
relationship in reciprocal form 
(graph), 381 
logarithmic, 377-380 
illustrated (graph), 379 
practical estimates based on 
equation derived, 388-390 
standard error of estimate, cal- 
culated, 390-394 
transformation of problem into 
simple linear correlation, 379— 
380 
parabolic, 383-388 
curve fitted directly, 384 
Doolittle method for solving 
three equations, 384-388 
work sheet (table), 386 
practical estimates based on 
equation derived, 388-390 
standard order of estimate, cal- 
culated, 390-394 
reciprocal, 381-383 
illustrated (graph), 381 
practical estimates based on 
equation derived, 388-390 
standard error of estimate, cal- 
culated, 390-394 
transformation to simple linear 
correlation, 351-353 
standard error of estimate, 390- 
394 
calculated, 391-393 
differences for three types of 
regression, 393-394 
first-order standard deviation 
used as, 390-391 
summarized with practical esti- 
mates (table), 393 
use of Pearsonian coefficient of 
correlation, in logarithmic 
approach, 379 
in reciprocal approach, 382 
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Cycle determination, 637-650 
in annual data, 637—642 
annual trend analysis (table 
and graph), 640 
cyclical movements 
(table), 641 
major cycle and cycle with 
residuals (graph), 642 
danger in extrapolating trends, 
648 
major cycle, 641-642 
method of ratios vs. method of 
differences, 648-650 
in monthly data, adjustment re- 
quired, 637, 642-644 
danger in extrapolating trends, 
648 
method of determining cycle 
illustrated, 644-647 
where trend is a second- or third- 
degree polynomial, 647—648 
work sheet (table), 643 
Cycles, 545, 637-050 
analysis by empirical trends, 594— 
598 
ogive-like, 552 


shown 


D 


Data, cumulative vs. noncumulative, 
127-128 
gathering of, 24-51 
construction of questionnaires 
or schedules, 30-42 
rational basis for, 28-29 
sampling, 42-49 
units of description and meas- 
urement, 28-48 
(See also Questionnaires; 
Schedules) 
sources of (see Sources of statisti- 
cal data) 
three types of statistical, 4-6 
De Moivre, A., 242 
Density function, 489 
in description of multivariate 
distributions, 488-490 
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‘Determination of normality (see 
Normality) 
Dewey, John, 22 
Directory of Federal 
Agencies, 64 
Distribution, of frequency (see Fre- 
quency distributions) 
of probability (see Probability 
distributions) 
symmetrical binomial (see Sym- 
metrical binomial distribu- 
tion) 
Domesday Book, 25 
Doolittle work sheet for curvilinear 
correlation, 384, 386-388 
Doolittle work sheet for curvilinear 
regression, work sheet (table), 
386 


Statstical 


E 


Arthur Stanley, 


Eddington, Sir 
18-19, 21-22 
Einstein, Albert, 21 
Empirical trends, 582-598 
analysis of cycles by, 594—598 
straight-line and third-degree 
trends with raw data, illus- 
trated (graph), 597 
conclusions from trends de- 
rived, 598 
work sheet for trend and index 
of normal (table), 594 
work sheet for trend values, 
method of finite differences 
(table), 597 
finite differences method for trend 
values, 589-594 
aid for computing finite differ- 
ences at £ = 0 (table), 590 
building up 2 polynomial 
(table), 589 
danger of cumulative error, 
593-594 
maximum cumulated errors 
(table), 593 
work sheet (table), 592 
polynomial, 583-594 
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Empirical trends, polynomial, econ- 
omy of calculation in work 
sheet, 583-589 
economical work sheet, alge- 
braie illustration (table), 
585 
economical work sheet, arith- 
metical illustration (table), 
586 
work sheet for second-degree 
polynomial (table), 588 
straight-line trend, 582-583 
work sheet for index of normal 
and trend, 582 
Enumeration, districts, U.S. census, 
35 
problems of, 28-42 
Enumerators, directions to, 35-39, 
42, 44 
training of, 30, 42, 44 
typical problems facing, 29 
Equiprobability, ellipsoids of, 489- 
490 
Error curve, 204-295 
Error, standard, of estimate, 383 
for statisties where sampling 
distribution approximates 
normal curve (table), 320 
Estadistica, 90 
Estimates, manufacturers, 7 
Euler, L., 242 
Extrapolation of trends, danger in, 
648 
F 
Federal Reserve Bank of New York, 
517 
Monthly Review of Credit and 
Business Conditions, 534 
Federal Reserve Board, 500, 532 
index, 533 
Federal Reserve Bulletin, 86, 531-532 
Federal Reserve System, 512 
Federal statistical agencies, 71-84 
(See also Sources of statistical 
data) 
Financial statistics, sources of (see 
Sources of statistical data, 
financial) 
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Finite differences, method of finding 
trend values, 589-594 
danger of cumulative error, 
593-594 
First-order standard deviation, de- 
finition, 390 
Fisher, Irving, 518, 530 
Forecasting, 651-680 
agencies, 661—662 
Babson, 661, 671-672 
Brookmire Economie Society, 


661, 671-672 
Harvard Economie Society, 
661, 671-673 
Moody's Investor's Service, 
661, 671-672 
Standard & Poor's Corporation, 
661, 671-672 
ancient origin of pseudo-scientific, 
651 
combined seasonal and cyclical, 
677-678 


illustrations of, 678 
commercial uses of, 661—662 
cycles with time series, 661-675 
general business conditions, 


663-673 

business barometer, 664, 666, 
670 

combination indexes, 666- 
667, 670 


crosscut analysis method, 
664-665, 671-673 

historical analogy method, 
663-671 

indexes of national-income, 
667-670 

indexes of physical volume of 
production (Babson index), 
670-671 

lead-lag difficulties in fore- 
casting, 670 

types of indexes, 664-673 

particular lines of activity, 673— 

675 

crosscut analysis method, 675 

crude historical ` analogy 
method, illustrated, 673 


Forecasting, cycles with time series, 
particular lines of activity, 
cycle hypothesis for, 674-675 
lead-lag relationships, 674 
from distribution studies, 656-661 
bivariate distributions, 657-658 
errors of forecasts, 659 
monovariate distributions, 656 
657 
multivariate distributions, 658— 
659 
modern scientific, 652-656 
conditional, 653-654 
illustrations of, 654-656 
popular dramatization of fore- 
casts, 652-653 
qualitative vs. quantitative, 654 
use of statistics in, 656 
quality and effect of economic 
forecasting, 679-680 
with seasonal variation, 675-678 
historical analogy, 676-677 
trends with time series, less exact, 
forecasting, 660-661 
more exact forecasting, 659-660 
Foreign trade statistics, sources of, 77 
Fourier’s theorem, 561 
Fréchet, Maurice, 250 
Frequency concept of probability, 
247-249 
Frequency curves, definition, 631-164 
derivation from histograms, 162- 
164 
formulas for, 263-267 
normal (see Normal frequency 
curve) 
uses of, 164-166 
in graduating observed data, 164— 
165 
as a norm, 165 
in sampling analysis, 165-166 
Frequency distribution analysis, 
numerical computation, 108- 
231 
arithmetic mean, rule for, 214 
averages and variability, diffi- 
culties in locating median and 
mode, 216-217 
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Frequency distribution analysis, 


numerical computation, beta 
coefficients, 216 
calculations, 216-227 
average deviation, 220-224 
histogram assumption in 
grouped data, 221-224 
mid-value assumption in 
grouped data, 220-221 
averages and variability, diffi- 
culties in locating median 
and mode, 216-217 
coefficients, of skewness, 226- 
227 
of variability, 225-226 
measures of skewness, 225 
median and quartiles, 218-220 
mode, 217-218 
semiquartile range, 224-225 
construction of class interval, 
199-206 
determining the class inter- 
val, effect of too many 
intervals, 202 
interval size chosen to re- 
veal character of varia- 
tion, 202-207 
ilustrative material, distribu- 
tion with various class inter- 
vals (tables), 203-205 
mean square deviation, 215 
moments about the arbitrary 
origin, 207, 211-212 
moments about the arithmetic 
mean, 212-214 
scatter diagram and graph, 201 
standard deviation, 215-216 
with unequal class intervals, 
+ 228-230 
variability and skewness, graph- 
ic interpretation of, 227-228 
work sheet, 206-216 
Charlier check for, 209 
entering the distribution, 208 
illustrated (tabie), 22è 
saving calculation, by obtain- 
ing moments about ap 
arbitrary origin, 267 


Frequency distribution ` analysis, 


numeral computation, work 
sheet, saving calculation, 
in use of work sheet, 208- 
209 
by using class-interval units, 
207-208 
theory, 158-198 
averages, 167-180 
arithmetie mean, 167-170 
concept of, as summary fig- 
ures, 179-180 
geometrie mean, 173-176 
harmonie mean, 176-179 
median, definition, 170-172 
theory, mode, definition, 172- 
173 
basic formulas used in, sum- 
mary of, 198 
beta coefficients, 192-193, 195 
bivariate (see Bivariate fre- 
quency distribution) 
charts of, 162-164 
histograms, 162-163 
area histograms, 162 
relative frequencies in, 162- 
163 
determination of normality of, 
297-306 
frequency curves (see Frequency 
curves) 
kurtosis of, 193-196 
measurements of summarization 
and comparison, 166-182 
measures of variability, average 
deviation, 183-184 
quartiles of, 171-172 
range, 182-183 
standard deviation, 184-185 
variance, 166, 185 
moments of, 180-182 
the centroid, 181 
moment coefficient, 180 
purpose of, 181-182 
multivariate (see “Multivariate 
frequency distribution) 
of populations, 166-167 
parameters of, 167 
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Frequency distribution analysis, 
theory, possible types of com- 
parison, 159-162 

as probability distributions, 254 
of sample data, 167 
statistics of, 167 
sampling distributions 
Sampling distributions) 
skewness of, 185-193 
symbols used in, summary, 197 
trivariate (see Multivariate fre- 
quency distribution, tri- 
variate) 
use of, 158-159 
Frequency distributions, 140-149 
conventional manner of graphing, 
143-144 
discrete vs. continuous, 142-143 
irrational, 156 
nature and illustration of, 140-142 
rational, 155-157 
Frequency polygon, relative slope at 
a given point computed (graph), 
288 
Frequency series, 138-143 
definition, 138-139 
(See also Frequency distribution) 
Frequency surfaces, 469-492 
bivariate, 471—486 
bivariate histogram, 469—471 
multivariate (density funetion), 
488-491 
Frequency table, 143 
Functions, compound-interest, 261— 
262 
discount, 263 
explicit, 255-256 
exponential, declining, 262-263 
rising, 261-262 
functional relationships, 255-257 
hyperbolic (table), 695-696 
implicit, 255-256, 259-260 
joint, 255 
linear, 256 
nonlinear, 256-257 
simple, 257-263 
(See also Simple functions, 
graphs of) 


(see 
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G 


Galton, Sir Francis, 14, 196, 293, 
323-324 
Gaussian error curve, 194, 294-295 
Geological Survey, 533 
Gompertz logistic curve, 554 
Goodness of fit, described, 287 
illustrated (graph), 288 
test of, 300—305 
[See also Chi square (x?) test 
of goodness of fit] 
Government Publications and Their 
Use, 65 
Graphs, (see also Charts), 
of simple functions, 257-267 
(See also Simple functions, 
graphs of) 
Graunt, John, 65-66 
Growth, curves, 149 
(See also Rational trends) 
explanation of, 553 
Guides to sources, 62-65 
governmental, 64-65 
Directory of Federal Statistical 


Agencies, 64 
Government Publications and 
Their Use, 65 


U.S. Government Manual, 64 
U.S. Government Publications, 
65 
nongovernmental statistics, 62-64 
handbooks and general index 
material, 63-64 
magazine indexes, 62-63 
Guilds, early sources of statistics, 
25 


H 

Handbooks, 57 

Harvard College Observatory, 13 

Harvard index, 665-666 

Heisenberg, W., 19-20 

Heisenberg’s uncertainty measure- 
ment, 19-20 

Histograms, in frequency distribu- 
tions, 162-163 
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Hollerith, Herman, 55 

Hollerith tabulating machines, 55, 
72-73 

Huygens, C., 242 

Hyperbolic functions (table), 695- 
696 

Hyperplane of regression, 490-491 


I 


Index to Business Indexes, An, 63 
Index chart, capital formation, 669 
consumer spending, 668 
Harvard, 665 
Indexes, adjustment to bench marks, 
535-542 
ideal conditions for stratified 
sampling absent, 535-536 
method of adjustment illus- 
trated, 538-542 
monthly indexes adjusted to 
census figures (table), 540- 
541 
reasons for adjustment, 536-538 
of correlation (see Correlation 
index) 
of general business conditions, 
533-535 
Harvard, 665-666 
method of computation 
trated, 528 
price indexes, aggregative, using 
given-year weights, 529 
of production, 82, 530-531 
quantity indexes, and business 
barometers, 530-535 
stratified sampling in, 530 
of trade and production, 530— 
531 D 
computation of weights illus- 
trated, 532-533 
weighted by prices, 529-530 
relative series from time series 
(chart), 130 
U.S. Bureau of Labor Statistics, 
construction of indexes, 525- 
527 


illus- 
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Index numbers, 497-542 
application of sampling technique, 
513-514 
stratified sampling, 515-516 
composite, 512-513 
stratified sampling in construc- 
tion of, 513 
construction of, ^ aggregative 
method, simple, 522-524 
weighted, 524-525 
average-of-relatives 
simple, 518-520 
weighted, 520-522 
methods in general, 518 
conversion of absolute to relative 
numbers, 500-511 
absolutes, 500-501 
relative parts of a whole, 509- 
511 
relatives, 501-503 
relatives using a base period in 
time series, 503-505 
presumption of normality in 
base selected, 505-509 
history of discovery and use, 497- 
500 
simple, great variety in use, 511- 
512 
variety of purposes of, 516-518 
Industrial statistics, sources of, 66, 
77, 88 
Bureau of Manufactures, 77 
The Economic Almanac, 67 
Industrial Commission, 83 


method, 


1.Q., 10 

International Statistical Yearbook, 85 

Intuitive-axiomatic approach to 
probability, 250-251 

Tron Age, 68 


J 


Journal of the American Statistical 
Association, 534 


K 


King, W. L., 516 
Kolmogoroff, A., 250 
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Kurtosis, 162, 193-196 
Kuznets, Simon S., law of growth, 
explanation of, 554-558 


L 
Labor statistics, sources of 61, 74— 
75, 84 : 
Commission on Industrial Rela- 
tions, 84 


Department of Labor, 61 
Bureau of Labor Statisties, 78, 
86 
Monthly Labor Review, 78 
Lagrange, J. L., 242 
Lambert, J. H., 242 
Laplace, P. S., 242-243, 250 
Law of large numbers, 239-240 
League of Nations, 88-91 
indexes, 512 
publieations, 512 
Least squares, method of, to find 
line of regression in bivariate 
frequency distribution, 331-334 
Legendre, A. M., 242 
Line of regression, 329-335 
becomes hyperplane of regression 
in multivariate distribution, 
490-491 


derived by method of least 
squares, 331-334 

interpretation of, 336-338 

means of rows and columns 


(table), 330 
relationship to r, 349-351 
standard deviation about means 
or line of regression, 336-338 
standard deviations for columns 
of data given (table), 337 
of X; on Xo, 330-334 
illustrative diagram, 332 
of X: on X;, 335 
illustrative diagram, 334 
Linear plane of regression, second- 
order variances for, 413-416 
Lines of regression. in bivarate fre- 
quency distribution, caleuizted 
from given data, after compu - 
ing r, 362-363 
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Lines of regression, work of compu- 
tation in fitting to time series 
when more than two coeffi- 
cients, 599 

Loci of equiprobability, in mulit- 

variate frequency "surface," 489 

Logarithmie charts (see Ratio charts) 

Logarithmic regression, 377-380 

Logarithms, of numbers, four-place 
common (table), 681-684 

scale for ratio charts, 131-137 

Logistic growth curves, (see Rational 

trends) 


M 


Magazine indexes, 62-63 
Market Research Series, 63 
Maximum likelihood, method for 
single best estimate of popula- 
tion percentage in sampling, 
314-315 
Means, progressions of, graph, 328- 
329 
Measurement of General Exchange- 
Value, The, 500 
Minerals Yearbook, 58 
Mises, Richard von, 245-251, 269 
Mitchell, Wesley C., 500, 514, 530, 
553 
Business Cycles—The Problem and 
Tts Setting, 499, 513, 535 
law of growth, explanation of, 553 
Monthly Bulletin of Statistics, 85 
Monthly Labor Review, 78 
Multiple correlation, 397—436 
analysis of variance in, 422-429 
coefficient of direct determina- 
tion, 424 
illustrated (diagram), 424 
coefficient of joint determina- 
tion, 425 
coefficient of multiple correla. 
tion, 426—428 
coefficient of net regression, 
424-425 
»-beta cross-product term, 424— 


425 . 
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Multiple correlation, analysis of vari- 
ance in, r beta cross-product 
term, illustrated (diagram), 
424 

residual variance, illustrated 
(diagram), 424 
analysis of variance and causal 
relationships, 428-429 
coefficient of, 398, 416-418 
definition, 397-399 
extended to any number of vari- 
ables, 434-436 
extension of formulas, high- 
order variances, 436 
multiple-correlation formulas, 


436 

partial-correlation formulas, 
436 

statistics for regression 


planes, 435-436 
general approaches, 434-436 
extension of analysis to four 
variables, 429-434 
multiple correlation coefficient, 
434 
partial correlation in four-vari- 
able case, 433-434 
in terms of correlation statistics 
of same order, 432-433 
in terms of lower-order correla- 
tion statistics of same kind, 
431-432 
in terms of lower-order 7’s and 
eis, 432 
third-order variance, 434 
linear vs. nonlinear relationships, 
399-400 
notation used in, 401—404 
meaning of subscripts before 
and after point, 402 
symmetry of, 404 
partial correlation (see Partial 
correlation) 
Multiple linear regression, 410-416 
beta form of regression equation, 
410-413 
obtained by method of least 


squareg, 410 


Multiple linear regression, beta form. 
of regression equation, a's and 
b’s calculated from beta form, 
412-413 

second-order variances for linear 
plane of regression, 413-416 
Multivariate frequency distribution, 
404-410 
analysis, illustrated, 437-468 
trivariate statistics, interpreta- 
tion of results, illustrated, 
analysis of variance in X, 
451, 455 
estimates based on regression 
equation, 450-451 
partical-correlation coeffi- 
cients, 451 
best approaches in studying, 409- 
410 
calculation of trivariate statistics, 
444-450 
all-round check on, 450 
equations of three planes of 
regression, as found, 448-449 
first-order correlation statistics, 
445-450 
a statisties, 448 
bstatistics from the beta'8,448 
coefficients of partial correla- 
tion, 447-448 
first-order beta’s from zero- 
order r's (table), 446 
interpretation of results, illus- 
trated, 450-451, 455 
multiple-correlation coefficients, 
449-450 
second-order standard devia- 
tions, 449 
zero-order correlation statistics, 
444-445 
examination of, 437, 442-444 
by testing net regression, 437, 
442-443 
trivariate, 405-410 
conditions for independence of 
all variables, 408-409 
illustrated (diagram), 405 
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Multivariate frequency distribution, 
trivariate, studied by breaking 
up into bivanate distributions, 
406-408 

trivariate analysis illustrated, cor- 
relation tables of, 406-407 
correlation table of, X; and X+, 
406 
X. and X;, 406 
X: and X;, 407 
Multivariate frequency surface, non- 
normal distribution, 491—492 
normal, 488-491 
deviations normally distributed, 
490 
effect of greater correlation on 
ellipsoid shape, 490 
ellipsoids of equiprobability, 
489-491 
illustrated (graph), 490 
in reality a density function, 489 


N 


National Bureau of Economic Re- 
search, 67 
National Industrial Conference 
Board, 66 
National Research Planning Board, 
publications, 84 
National Resources Committee, 48 
New Jersey State Labor Dept., 538 
New York Times, The, 534 
Newtonian mechanics, 19-21 
Neyman, N., 236, 250 
Nonlinear correlation, 365-396 
(See also Curvilinear regression) 
Nonnormality, in bivariate or multi- 
variate distributions, 491-492 
of population in sampling, 316 
Normal frequency curve, 232-320 
algebraic and graphic representa- 
tion of, 263-267 
algebraic formula, 264 
graph, 263 
graphs of curves with different 
means and same standard 
deviations, 265 : 


Normal frequeney curve, algebraic 
and graphie representation of, 
graphs of curves with same 
mean and different standard 
deviations, 266 

areas under (table), 693 

fitted to histogram of given data 
(graph), 295 

method of fitting to sample histo- 
gram, 269-300 

ordinates of (table), 694 

real life conditions producing, 
290-297 

recurrence in statistieal analysis, 
264 

standard normal curve (graph), 
267 

and symmetrical binomial distri- 
bution, 279-306 

use in theory of sampling, 264 

useful approximation to binomial 
distribution where N is large, 
308 

Normal frequency surface, 469-496 
(See also Bivariate frequency 

surface; Multivariate fre- 
quency surface) 

Normal probability curve, (see Nor- 
mal frequency curve) 

Normality, determination of, in bi- 

nomial distribution, 297-306 
by comparison of special statis- 
ties, 305-306 
in frequency distributions, 297— 
306 
by graphic comparison, 298-300 
fitting normal curve to sam- 
ple histogram, method of, 
299-390 
by test for goodness of fit (see 
Goodness of fit, test of) 
in time-series, indexes, 505-509 
of population, in sampling, 316- 
317 


o 


Order, of correlation coefficients, 422 
of correlation statistics, 422 
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Order, designation in correlation sta- 
tistics, indicates combination of 
variables, 422 

of regression statistics, 422 
of standard deviations, 422 
of variance, 363-364, 414-416 
Orthogonal-polynomial trends, 599- 
616 
calculation of coefficients A, B, 
Q,. . . , 606-607 
by subtotal summation type of 
work sheet, 608-612 
orthogonal polynomials, defini- 
tion, 600-601 
forms used in fitting trends, 
603-606 
tables to save calculation, 612-615 
values of specified variables, 
dependent on number of 
years (tables), 613-615 
trend line by method of least 
squares, 601-603 
uses in trend analysis, 599-600 


P 


Parabolic regression, 383-388 
Parameter, definition of, 167 
Partial correlation, coefficient of, 
418-422 
calculation, 420-422 
definition, 400—401 
notation used in, 401—404 
obtained between two variables 
by holding third variable con- 
stant, 419-420 
Pascal, B., 242 
Pearl-Reed population curve, 549 
Pearson, Karl, 66, 293-294, 323- 
325, 339 2 
Pearson-Galton apparatus for bi- 
nomial distribution, 293-294 
illustrated, 293 
Pearsonian coefficient of correlation, 
338-349 
arithmetie view of r, 339-347 
Pepin the Short, 24 
Percentage, population percentage, 
313-315 


Periodogram, 561-562 
Permutations, defined and illus- 
trated, 232-233 
Persons, Warren M., 66, 530, 623 
Petty, Sir William, 65-66 
Pictograms, 102-103 
Planck, Max, 19 
Playfair, William, 100-101 
Polling agencies, 6 
Polynomials, definition, 256 
first-degree, graph of, 257-258 
implicit, 255, 259-260 
second-degree, graph of, 258-259 
Population, curves, 548-549 
laws of growth, 549-550 
technical term in frequency dis- 
tribution, 166 
theories, early, 549-550 
Prescott, Raymond B., 554 
Presentation of statistics, 92-121 
cartograms, 112-121 
(See also Cartograms) 
charts, 100-121 
(See also Charts) 
tables, 92-100 
Probability, combinations, 233-236 
concepts of, 236 
classical, 242-243 
criticism of classical concept, 
243-247 
meaning of “equally likely,” 
244 
principle of indifference, 244- 
245 
principle of sufficient reason, 
245-246 
subjective character of, 246- 
247 
frequency concept, 247 
criticism of von Mises’ 
theory, 247-250 
intuitive-axiomatie approach, 
250-251 
curve, formulas for, 263-267 
(See also Frequency curves; 
Normal frequency curve) 
definition, 237 
dependent, 271-272 
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Probability, empirically determined, 
240-241 
independent, 270-271 
independent vs. dependent illus- 
trated in real life case, 273-274 
law of large numbers, 239-240 
permutations, 232-233 
of possible combinations of 10 
coins (table), 282 
randomness, 241—242 
and relative frequency of actual 
events, 239-242 
Probability calculus, 268-278 
addition theorem, 268-269 
for dependent probabilities, 271— 
272 
examples of calculation, for dis- 
crete distributions, 274-276 
for continuous distribution, 
276-278 


independence vs. dependence il- ` 


lustrated in real life case, 
273-274 
for independent probability, 270- 
271 
multiplication theorem, 269-274 
statement, 269 
Probability distributions, 252-267 
continuous, 253-254 
discrete, 253 
functional relationships in (see 
Functions) 
identical with certain types of 
frequency distribution, 254 
probability curve, 254 
(See also Normal frequency 
curve) 

Probability sets, calculation of “ de- 
rived” or “second-order” sets, 
26977. 

finite, multiplication theorm valid 
for, 272 
fundamental, 237-238 
infinite, 238-239 
multiplication theorem valid 
for, 272 
Problem of Estimation, The, 500 


Product deviation, measurement of, 
339-347 
Product-moment coefficient of cor- 
relation, 339]. 
Product-moment formula for r, use 
in nonnormal frequency distri- 
butions, 491—492 
Product term, definition, 485 
disappears where correlation is 
absent, 485 
Public opinion, sampling of, 6 
Publications, statistical (see Statisti- 
cal publications) 


Q 


Quality control, 18, 248 
Quantum theory, 19-21 
Quartiles, calculation of, 218-220 
definition, 171-172 
interpretation, 227-228 
use in measuring skewness, 189- 
191 
Questionnaires, mailed, 48-49 
good-will letter used in support 
of (typical form), 50 
rules for constructing, 49, 51 
(See also Schedules) 
Quételet, A., 87, 499, 513, 549-550 


R 


Ratio charts, 131-137 
advantages and disadvantages of, 
135-137 
paper used for, 133-135 
relative growth shown on, 131, 
136-137 
three scales of paper used for, 
134-135 
value for comparisons impossible 
on arithmetic paper, 136-137 
Ratio scale (see Semilogarithmic 
paper) ` 
Rational trends, 547-558, 574-581 
dying institution, illustrated, 475- 
577 s 
possible trends in dying insti- 
tution, 574 


w 
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Rational trends, dying institution, 
illustrated, trend fitted by 
method of least squares 
(graph), 577 

work sheet for annual index 
of normal and trend 
(table), 576 
growing institution, illustrated, 
578-581 
curve fitted by method of 
selected points (graph), 581 
method of selected points, 
578-581 
work sheet for index of nor- 
mal and trend (table), 580 
Reciprocal regression, 381-383 
Reciprocals of numbers (table), 691— 
692 
Regression, linear plane of, 397-399, 
410-413 
(See also Linear plane of 
regression) 7 
logarithmie, 377-380 
multiple linear, 410-416 
parabolie, 383-388 
reciprocal, 391-383 
statistics, order in, 422 

Relative frequency, probability, 280 

Relativity theory, 21 

Research associations, 66-68 

Review of Economic. Statistics, 66, 

500, 530 


8 


Sampling, 42-48 
by Bureau of Labor Statistics, 
42-48 
fitting of normal curve to sample 
histogram, 299-300, 
in government study of family 
income and expenditures, 48 
of means, 315-319 
population mean, confidence 
limits for, 318 
estimate of, 318 
testing a hypothesis about, 
317-318 
sampling distribution, 315-316 
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Sampling, testing a hypothesis, 817— 
318 
in 1940 U.S. Census, 42 
of percentages, 307-315 
coefficient of risk, 310-311 
confidence coefficient, 311-312 
confidence interval, 313 
population percentage, deter- 
mining confidence limits 


for, 311-313 

likelihood of, defined, 315 

likelihood of, relation to 
probability of sample (dia- 
gram), 314 

maximum likelihood, estimate 
of, 313-315 

testing hypothesis about, 
309-311 


sampling distribution, 307-309 
statistical inferences from, 309- 
315 
types of inference, 309 
typical problem, 307 
random, 241-242, 307 
relative frequency of samples 
follows binomial distribution 
pattern, 308 
sampling distributions, (see Sam- 
pling distributions) 
standard errors for selected statis- 
tics, where distribution ap- 
proximates normal curve 
(table), 320 
stratified, in construction of index 
numbers, 515-516 
use of normal frequency curve in, 
307-820 
conclusions as to, 319-320 
used in business, 9 
of variances, population variance, 
316 
confidence limits for, 319 
optimum estimate of, 319 
testing a hypothesis about, 
318-319 
sampling distribution, 315-316 
standard deviation, the, 317 
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Sampling distributions, 307-317 
explained, 307—308 
of sample means, 315-316 
of sample percentages, 307-308 
of sample variance, 316-317 
Schedules, coding of, 53-55 
editing of, 52-53 
mailed questionnaires (see Ques- 
tionnaires, mailed) 
problems of enumeration, 28-42 
questionnaires (see also Question- 
naires) 
tabulation of, 55 
(See also Tables; Tabulation) 
units of description and measure- 
ment, 28-34 
illustrations of government care 
in, 35-48 
Seasonal variation, 617-636 
causes of, 618-621 
historical background of study, 
617-618 
in labor, McCabe, 620 
measurement illustrated, 625-633 
calculation by 12 months’ mov- 
ing average method, 625, 
630-633 
completed index (table), 631 
multiple frequency array, deter- 
minations from, 633 
illustrated (graphs), 631-632 
work sheet for calculating index 
(tables), 626-630 
method of detecting change in, 
633-636 
computation of index for single 
year (table), 636 
index required for each year 
because of observable trend, 
636 
trends in seasonal variation 
illustrated (graphs), 634-635 
methods of measuring, 621-625 
Kemmerer, 623 
Persons, 623 
problem of isolating, 621-625 
link relative method, 623 
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Seasonal variation, problem of iso- 
lating, ratio-difference-from- 
trend method, 624 

twelve months’ moving average 
method, 624-625 
various suggested methods, 
bibliography for, 624n 
testing whether well defined, 632 
trend in, 634-635 
Second-order, indicates statistic with 
two figures to right of decimal 
in subeript, 422 
Semilogarithmie paper, 133-135 
Series, bivariate, 149-154 
frequency, 139-149 

Sheppard's correction, 299 

Shewhart, W. A., 248 

Significant figures, meaning of, 230- 

231 
Simple correlation, 351-353 
lines of regression, calculated, 
362-363 
Simple functions, graphs of, 257-267 
circle, 259-260 
ellipse, 260 
exponential function, declining, 
262-263 
rising, 261-262 
first-degree polynomials, 257- 
258 
normal frequency curve, 262-967 
second-degree polynomials, 258— 
259 
Simpson, C. G., 14, 242 
Single best estimate, of population 
percentage in sampling, 314-315 
Skewness, definition and significance 
of, 185-193 
measurement of, by beta coeffi- 
cients, 192-193 
by medians and quartiles, 189— 
191 
by relation of mean, median, 
and mode, 185-189 
by third moment, 191-192 

Smith, Adam, 11-12 

Smithsonian Institution, 13 

Social Science Research Council, 48 
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Social Security Administration, 537 
Sources of statistical data, 56-91 
(See also Guides to sources of 
statistics) 
agricultural, (see Agricultural sta- 
tistics) 
banking, (see Banking statistics, 
sources of) 
commercial, 68-71 
commercial and financial publi- 
cations, 70-71 
trade associations, 69-70 
trade journals, 68-69 
federal, 71-84 z 
Congressional investigations, 82 
financial, 70 
commercial and financial publi- 
cations, 70-71 
general summary, developing pat- 
tern of sources, 59-62 
for social sciences, 57-58 
guides to™(see Guides to sources) 
industrial, (see Industrial statis- 
ties, sources of) 
international, 91 
on labor, (see Labor statistics, 
sources of) 
pattern of existing (outline), 61-62 
primary vs. secondary, 56 
private research, individuals, 61, 
65-66 
handbooks on, 63 
research associations, 66-68 
(See also Statistical publiea- 
tions) 
state and municipal, 84-85 
on trade, (see Trade) 
on transportation and com- 
munication, (see Transporta- 
tion and communication 
statistics) 
world statistics, 85-91 
best sources of, 91 
Split-bar charts, 110-111 
Square roots of numbers, 100-1000 
table, 689-690 
Squares, 100-990 
of numbers (table), 685-686 


Standard deviation, first-order, 338 
from lines of regression, 336-338 
zero-order, 338 

Standard error, of variance, 316 
of estimate, 338 

definition, 390 
Standard Industrial ` Clssification 
Code, 54 

Statesman’s Yearbook, 86 

Statistic, definition of a, 167 

Statistical Abstract of the United 

States, 64 

Statistical Atlas, 73 

Statistical data (see Data) 
gathering of (see Data, gathering 

of) 

Statistical laws, 19 

Statistical publications, abstracting 

agencies, 57n 
world statistics, 85-91 
(See also Guides to sources) 
Statistics, accuracy in calculating, 
230-231 
in the arts and sciences, 1-23 
in astronomy, 12-13 
in biology, 14-16 
in business administration, 7-9 
definition and meaning of, 1-4 
descriptive, 232 
in economie theory, 11-12 
in eduention, 9-11 
in engineering, 16-18 
forecasting by means of, 651-680 
gathering of, 24-55 
historical development in, 24-28 
(See also Data, gathering of) 
in governmental administration, 
6-7 
in medicine, 16 
and philosophy, 21-22 
in physics and chemistry, 81-21 
in polities, 6 
presentation of (see Presentation 
of statistics) 
in sociology, 11 
sources of (see Sources of statisti- 
cal data) 
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Statistics, summarization and com- 
parison by means of fre- 
quency distributions, 158- 
198 

by means of index numbers, 
497-542 
measurements for, 166-182 
symbols used in (see Symbols used 
in statistics) 
theoretical, definition, 232 
by use of index numbers, 497ff. 
in zoology; 13-14 
Summarization and comparison, in 
bivariate frequency distribu- 
tions, 327-353 
measurements of, 166-182 
Survey of Current Business, 77, 512, 
, 595, 531, 534 
Symbols used in statistics, 122-129 
basic symbols, 122-124 
multiple and partial correlation, 
401—404 
time series, 124-126 
passage of time, 124-125 
units involved, 126-127 
where variable fluctuates with 
time, 125-126 
Symmetrical binomial distribution, 
character of, 283-285 
mean, 283-284 
moments, 285 
symmetry, 283 
~ variance, 284-285 
derivation, 280-283 
graph of, 284 
and the normal curve, 279-306 
beta values approach those of 
normal curve, 289 
distribution approaches normal 
curve as limit, 285-290 
graphie comparison, 288 
relative slope of frequency poly- 
gon and normal curve com- 
pared, 289 
real life conditions producing, 
290-297 
summary, 295-297 


Symmetrical binomial distribution, 
relative slope of frequency poly- 
gon computed for a given point 
(graph), 288 

for two values of N (graph), 286 
effect of scale adjustments 
(graph), 287 
seen in relative frequencies, 
282, 291-293 
Symmetry, in frequency surfaces, 
486-487 
in notation for multiple and par 
tial correlation, 404 
writing of equations by, illus- 
trated, 431—432 
T 
Tables, 92-100 
construction of, 92-93, 95-96 
general-purpose, 93-94 
special-purpose, 93, 95-97, 100 
types of, illustrated, 94-99 
‘Tabulation, machine, 55, 72-73 
mechanics of, 73 
principles of, 92 
(See also 'Tables) 

Test of goodness of fit, 300-305 
(See also Chi square (x?) test of 

goodness of fit) 

Theorem, Fourier's, 561 

'Theory of errors, 294-296 

not intended to mask inaccuracy 
of calculation, 230 

Theory of relativity (see Relativity 
theory) 

Third-order statistics, 422 

Time series, analysis of (see Cycle 
determination; Seasonal varia- 
tion; Trend analysis) 

careful description of units in- 
volved, 126 
conventional charting of, 128-129 
cumulative vs. noncumulative 
data, 127-128, 
elements of variation. in, 543-547 
cycle, 544-547 +s- 
long-term growth or trend, 543- 
547 
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Time series, elements of variation 
in, residual fluctuations, 574 
seasonal variations, 543-547 
hypothetical, showing elements of 
variation (table), 544 
rational basis of analysis of, 543— 
563 
rational trends, 547-558, 564 
Time-series analysis, development of 
technique for, 560-563 
harmonie (periodogram) analy- 
sis, 561—562 
major cycle determination by 
Kuznets' methods, 561 
ordinary and minor cycle deter- 
mination by ` empirical 
methods, 561-562 
use of functions of are tangent, 
562 
use of orthogonal polynomials, 
562, 599-616 
empirical trends, 558-560 
application to cycle analysis, 
558-560 
empirical vs. rational trends, 564 
rational basis for, 543-563 
rational trends, application to 
social philosophy, 553-558 
basis for rationalizing, 550-552 
criticism of, 552-553 
early population theories, 549— 
550 
historieal background, 547-548 
population curves, 548-549 
(See also Cycle determination; 
Seasonal. variation); Trend 
analysis 
Trade, Department of Commerce, 86 
Commerce Yearbook, 86 
domestic, 81 A 
Federal Trade Commission, 81 
foreign, 77, 82, 86 
Bureau of Foreign and Domes- 
tic Commerce, 77 
Statistical Abstract of the 
United States, 77 
Survey of Current Business, 77 
Statesman’s Yearbook, 86 
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Trade, U.S. Tariff Commission, 82 
Transportation and communication 
statistics, 81 
Interstate Commerce Commis- 
sion, 81 
Treasury Department, 7 
Treatise on Money, J. M. Keynes, 
517 
Trend analysis, 564-616 
detecting cycle by removing em- 
pirical trend, 565 
empirical vs. rational trend, 564 
empirical trends, illustrated, 582- 
598 
analysis of cycles by empirical 
trends, 594-598 
finite differences method for 
finding trend values, 589-594 
polynomials, 583-594 
straight-line trend, 582-583 
methods of fitting trend, 565-574 
by averages, 573 
moving averages 
573-574 
by least squares, 565-573 
advantages of method, 574 
basic method, 565-568 
numerical illustration, 568- 
569 
probability theory not ap- 
plied, 570-571 
second- or third-degree 
curves, 569-570 
by selected points, 57 1-573 
orthogonal-polynomial trends (see 
Orthogonal-ploynomial 
trends) 
rational trends, illustrated, 574- 
581 
dying institution, 574-577 
growing institution, 578-58) 
Trends, empirical (see Empirical 
trends) 
orthogonal-polynomial (see Orth- 
ogonal-polynomial trends) 
rational (see Rational trends) 


method, 
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Trivariate frequency distribution, 
405-410 
Twentieth Century Fund, The, 68 


U 


U.S. Bureau of Census, 42, 48, 53- 
56, 532 
U.S. Census, 30-38, 532 
development of, 71-76 
U.S. Census of Agriculture, 38-42 
U.S. Department of Agriculture, 12, 
524 (see also Agricultural Sta- 
tistics) 
U.S. Department of Commerce, 56, 
504, 512, 524, 531 
Bureau of Census (see U.S. 
Bureau of Census) + 
Bureau of Foreign and Domestic 
Commerce, 56, 76-77, 512, 
535 
Survey of Current Business, 525, 
531, 534 
U.S. Department of Labor, 525 
U.S. Government Manual, 64 
Units, careful description necessary 
in time series, 126-127 
of enumeration, 1 ` 
of measurement, 1-2 
Univariate frequency distribution, 
325 


v 


Van Buren, President Martin, 72 
Variable attributes, 157 
Variable, continuous, illustrated, 
292-293 
discrete, illustrated, 291-294 
but not integral, illustrated, 292 
integral, illustrated, 291-292 
nonintegral, illustrated, 292 
Variable X, in an array, illustrated 
use of, 139-140 
Variability, 182-185 
average deviation, 183-184 
range, 182-183 


Variability, standard deviation, 184- 
185 
universal condition facing scien- 
tist, 122 
Variance, calculation of, 215 
definition of, 185 
first-order, calculated, 363-364 
defined, 338 
meaning of, 336-338 
relation to r, 351-352 
proportion measured, by square of 
correlation coefficient, 353 
by square of correlation index, 
396 
by square of correlation ratio, 
373-376 
sampling distribution of, 316 
second order, for linear plane of 
regression, 413-416 
third order, 434 
Variation, frequency series, 138-143 
static, frequency distribution as 
tool for analysis, 158 
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